用遗传算法识别潜在的模式

文号t53182

16606

用遗传算法识别潜在的模式

epi.clyce2012/12/23软件综合 IP:上海

欢迎大家来我的博客查看原文

[hr]

这几天发现了一道趣题，大致如下：

基因组的表达特征可以表示成类似y=f(x)的函数，在第x组条件下的实验中的某一种表达记为y。
现假设：y在x变化时有着相同或相似变化趋势的属于同一种表达。
科学家们现在确信他们的样本里已经存在一定数量具有相同的表达特征的基因，但由于样本中不同种类基因混杂，实验数据噪声极大，难以提取关键信息。
试写程序，根据他们的实验数据分析该基因的表达特征。

纵观题目，属于模式识别范畴的，然而有两个非常重要的问题：

- 模式自身是未知的，要先通过数据集中各类线索推断出模式

- 数据集中混杂着各类噪声数据，难以直接提取模式

本来打算使用迭代k-means将数据不断二分，之后尝试寻找模式，但是一来K-Means的划分本身并不精确，二来如此迭代速度奇慢，而且在最终结果找出之前并不知道模式出现在哪个数据区域中，就必须一直保持原来的数据，使得数据量很大时程序非常容易溢出。

折腾许久，重新申题，突然就从得到了灵感：

既然题目和我扯生物，那我就以其人之道，还其人之身，以生物办法解之！

就这么愉快的决定了！

首先是问题建模：

问题目的：根据已有的大量基因表达数据解算出样本中包含的一组具有相同表达特征的基因特征

解决方案：从样本中随机选取特定数量的种子基因，设法使他们进化到具有目标表达特征。

接下来开始设计算法，首先设计进化所需的环境：

根据我们的解决方案，我们认为，我们设计的环境应该是：任意基因与目标的表达特征相似度越高，存活几率越大。

由此写下计算基因相似度的函数：

首先我们需要让所有基因在比较的时候站在同一起点上，如[1 3 2 4 5] 和 [5 7 6 8 9]应当能够在比较的时候都化作[0 2 1 3 4]

于是我们可以这样写：

(defn bias [c]
  (map #(- % (first c)) c))

那么我们可以将两个基因的实验数据按位相减标识他们的相异度

(defn ^:int abs [^:int x]
  (if (pos? x) x (- 0 x)))

(defn diff [c1 c2]
  (apply +
         (map #(abs (apply - %))
              (partition 2 (interleave c1 c2)))))

然后除以实验次数以规格化数据，我们称两个基因的规格化相异度为距离

(defn ^:float distance [c1 c2]
  (/ (diff (bias c1) (bias c2))
     (count c1)))

之后我们为该基因对环境的适应性进行评分，因为只有一种模式的存在，所以目标模式一定存在于基因密度最高的地方。
为了拉大评分差距，我们设定一个种子基因x的环境适应程度与该种子基因到周围所有基因c的距离之积成反比。

但这里有一个问题，当数据量过大时，每一个种子基因都必须和所有基因进行一次运算，数据量会很大，相乘后数据值也会非常庞大。

于是我们引入采样率srate的概念，根据采样率的不同从全体基因中随机选取n个基因与其进行比较：

由此，我们可以开始写我们的第一个进化要素：优胜劣汰。

在选取n个种子基因后，我们设定每下一代种子的数量仍旧为n，同时每个种子保留到下一代的几率与其环境适应程度成反比。
那么每个种子在被选中进入下一代的概率应该是（（该种子的环境适应度）/（这一代种子中所有种子的环境适应度之和））
书写如下函数为种子群c中每个种子计算其拥有下一代的概率(c-all为全体基因组成的集合)：

(defn choose-rate [c c-all srate]
  (let [fit #(fitness % c-all srate)
        total-fit (apply + (map fit c))]
    (vec (map #(/ (fit %)
                  total-fit) c))))

采用轮盘赌的方式产生下一代，大致为，计算这一代每个种子的累积概率P(seed[i])=P(seed[0])+P(seed[1])+...+P(seed[i])
之后产生随机数，如果随机数在某两个种子的累计概率之间，择取概率较大的种子加入下一代中。
计算累计概率的函数如下：

(defn total-rate [c c-all srate]
  (->> (let [cr (choose-rate c c-all srate)]
         (loop [n 2 ret [(first cr)]]
           (if (> n (count c)) ret
             (recur (inc n) (conj ret (apply + (take n cr))))))) (interleave c) (partition 2)))

轮盘赌函数如下：

(defn choose-dna [n h]
  (let [r (filter #(> (last %) n) h)] (if (empty? r) (first h) (first r))))

那么就可以轻而易举写出产生下一代的函数：

(defn choose-gen [c c-all srate]
(for [n (range (count c))]
(first (choose-dna (rand) (total-rate c c-all srate)))))

但是这样只能在一代一代中过滤掉不那么优秀的种子基因，却不能让种子基因得到进化。

于是我们再一次求助于大自然，引入有性繁殖的概念。

让种子基因产生变化的算法如下：
对种子群中的种子进行两两配对，使它们产生的下一代中有可能出现彼此部分基因片段发生交换的结果。
比如：[1 3 2 4 5]与[5 7 3 2 8]配对之后，有一定几率产生[1 3 3 2 5] 和 [5 7 2 4 8]，其中第3,4位发生了交换。
我们写如下函数随机将两个种子基因分别分成两部分并使之交换：

(defn swap-dna [c1 c2]
  (let [n (int (rand (count c1)))]
    (->> (map shuffle (partition 2 (interleave (split-at n c1) (split-at n c2))))
      (apply interleave)
      (partition 2)
      (map (partial apply concat)))))

那么产生下一代时两次交换dna头尾，就可以达到有一定几率交换中间片段的效果
我们写如下函数来使一个种子群c中的种子配对并交换dna片段：

(defn swap-gen [c]
  (apply concat
         (map (partial apply swap-dna)
              (map (partial apply swap-dna) (partition 2 c)))))

接下来，为了加速种子的进化，我们还应该使得种子基因在产生下一代的过程中发生变异
在这里我们有一个问题，我们的种子基因并非是2元数据（如 [0 1 0 1 1 1 0]），而是一组有理数，二元数据可以通过随机将某一位置反来产生变异，而一组有理数如果添加随机数，那么所添加的随机数将随着下一代的不断产生而累加，最终得到奇怪的结果。（自然界中的DNA也可以近似为二元数据，因为ATCG是两两严格配对的）。

为了让由于目标基因模式一定存在于样本基因群体的内部，为了让种子的变异结果不至于过度偏离样本中的各类基因，我们重新定义种子的变异过程为：每个种子除了能够与配对种子交换基因片段外，还有一定几率与样本中的其他基因（包括不在种子中的基因）发生基因片段的交换。我们定义在变异概率为vr下种子c在样本c-all中的变异函数如下：

(defn variate [vr c c-all]
  (if (< (rand) vr)      (first (swap-dna c (bias (rand-nth c-all))))     c))

那么在下一代中产生变异种子的函数就是：

(defn var-gen [vr c-all c]   (map #(variate vr % c-all) c))

那么产生下一代种子的整体函数可以写为

(defn next-gen [c c-all vr srate]   (->>
    (choose-gen c c-all srate)
    (swap-gen)
    (var-gen vr c-all)))

如此下来，我们便可以书写进化算法的整体函数，
从“from”样本中选取“seed-count”个种子并进化到第iterate-times代的函数是：

(defn evolve [from & {:keys [seed-count iterate-times variate-rate sample-rate]}]
  (loop [seedp '() seed (map bias (rand-sub from seed-count)) n iterate-times]
    (if (zero? n)
        seed
        (recur (conj seedp seed) (next-gen seed from variate-rate sample-rate) (dec n)))))

事实上我们只需要最终种子群中环境适应度最高的种子，其他可以扔掉了。
evolve函数改为：

(defn evolve [from & {:keys [seed-count iterate-times variate-rate sample-rate]}]
  (loop [seedp '() seed (map bias (rand-sub from seed-count)) n iterate-times]
    (if (zero? n)
      (second (last (sort-by first (map #(vec [(fitness % from sample-rate) %]) seed))))
        (recur (conj seedp seed) (next-gen seed from variate-rate sample-rate) (dec n)))))

整个算法写下来一共只有72行，整理如下：

(defn ^:int abs [^:int x]
  (if (pos? x) x (- 0 x)))

(defn rand-sub [c ^:int n]
  (take n (shuffle c)))

(defn bias [c]
  (map #(- % (first c)) c))

(defn diff [c1 c2]
  (apply +
         (map #(abs (apply - %))
              (partition 2 (interleave c1 c2)))))

(defn ^:float distance [c1 c2]
  (/ (diff (bias c1) (bias c2))
     (count c1)))

(defn ^:float fitness [x c srate]
  (/ 200
     (apply * (map #(+ 0.001 (distance x %))
                   (rand-sub c (int (* srate (count c))))))))

(defn choose-rate [c c-all srate]
  (let [fit #(fitness % c-all srate)
        total-fit (apply + (map fit c))]
    (vec (map #(/ (fit %)
                  total-fit) c))))

(defn total-rate [c c-all srate]
  (->> (let [cr (choose-rate c c-all srate)]
         (loop [n 2 ret [(first cr)]]
           (if (> n (count c)) ret
             (recur (inc n) (conj ret (apply + (take n cr))))))) (interleave c) (partition 2)))

(defn choose-dna [n h]
  (let [r (filter #(> (last %) n) h)] (if (empty? r) (first h) (first r))))

(defn choose-gen [c c-all srate]
  (for [n (range (count c))]
    (first (choose-dna (rand) (total-rate c c-all srate)))))

(defn swap-dna [c1 c2]
  (let [n (int (rand (count c1)))]
    (->> (map shuffle (partition 2 (interleave (split-at n c1) (split-at n c2))))
      (apply interleave)
      (partition 2)
      (map (partial apply concat)))))

(defn swap-gen [c]
  (apply concat
         (map (partial apply swap-dna)
              (map (partial apply swap-dna) (partition 2 c)))))

(defn variate [vr c c-all]
  (if (< (rand) vr)      (first (swap-dna c (bias (rand-nth c-all))))     c)) (defn var-gen [vr c-all c]   (map #(variate vr % c-all) c)) (defn next-gen [c c-all vr srate]   (->>
    (choose-gen c c-all srate)
    (swap-gen)
    (var-gen vr c-all)))

(defn evolve [from & {:keys [seed-count iterate-times variate-rate sample-rate]}]
  (loop [seedp '() seed (map bias (rand-sub from seed-count)) n iterate-times]
    (if (zero? n)
      (second (last (sort-by first (map #(vec [(fitness % from sample-rate) %]) seed))))
        (recur (conj seedp seed) (next-gen seed from variate-rate sample-rate) (dec n)))))

我们使用下面的代码生成测试数据：

(def test-dna-1
  (for [x (range 8)]
    (map #(+ (- (* 2 x) 8)
             (- (+ % (rand 5)) 3))
         [1 7 4 6 8 -4 3 9 5 -5 2 8 6 2 0 7])))

(def test-dna-2
  (for [x (range 60)]
    (for [i (range 16)] (- 15 (rand 30)))))

(def test-dna
  (shuffle (concat test-dna-1 test-dna-2)))

其中[1 7 4 6 8 -4 3 9 5 -5 2 8 6 2 0 7]是生成模式，test-dna-1是根据生成模式添加偏移量和随机量产生8个符合模式的dna

test-dna-2是使用随机数生成的60条干扰dna

test-dna是test-dna-1和test-dna-2放在一起后重新洗牌的结果，程序需要从这68条数据中找出test-dna-1中8条dna所遵循的模式。

书写以下代码产生简单的测试界面（引用seesaw图形界面库），为了测试过程中肉眼能够区分干扰数据和符合模式的数据，程序在绘制时将两种数据以不同的风格区分绘制，但处理的时候依旧用test-dna总数据：

(def main-canvas (canvas :background "#FFFFFF" :paint nil))
(def seeds (spinner :model (spinner-model 6 :from 1 :to 100 :by 1)))
(def vrate (spinner :model (spinner-model 40 :from 0 :to 100 :by 1)))
(def srate (spinner :model (spinner-model 100 :from 0 :to 100 :by 1)))
(def itr (spinner :model (spinner-model 30 :from 1 :to 1000 :by 1)))
(def bview (button :text "View Data"))

(defn draw-lines [gr x n & {:keys [r g b a w]}]
  (draw gr
        (path []
              (move-to (* 60 x) (* 20 (nth n x)))
              (line-to
                (* 60 (inc x))
                (* 20 (nth n (inc x)))))
        (style :foreground (color r g b a)
               :stroke     (stroke :width w))))

(defn draw-graph [c g]
  (let [w (.getWidth c)
        h (.getHeight c)]
    (push g
          (translate g 0 (/ h 2)) (scale g 1 -1)
          (doseq [n test-dna-1]
            (doseq [x (range (dec (count n)))]
              (draw-lines g x n :r 0 :g 0 :b 0 :a 255 :w 1)))
          (doseq [n test-dna-2]
            (let [red (rand-int 255) green (rand-int 255) blue (rand-int 255)]
              (doseq [x (range (dec (count n)))]
                (draw-lines g x n :r red :g green :b blue :a 128 :w 1))))
          (let [n (alg/evolve test-dna
                              :seed-count     (value seeds)
                              :iterate-times  (value itr)
                              :variate-rate   (/ (value vrate) 100)
                              :sample-rate    (/ (value srate) 100))]
            (doseq [x (range (dec (count n)))]
              (draw-lines g x n :r 255 :g 0 :b 0 :a 148 :w 4)))
          (anti-alias g))))

(def test-window
  (frame
    :title "Evolution"
    :content
    (top-bottom-split
      (left-right-split
        (horizontal-panel
          :items [(horizontal-panel :items [(label "Seeds: ") seeds])
                  (horizontal-panel :items [(label " | Variate Rate: ") vrate (label "%")])
                  (horizontal-panel :items [(label " | Sample Rate: ") srate (label "%")])
                  (horizontal-panel :items [(label " | Iterate Count: ") itr])])
        bview
        :divider-location 9/10)
      main-canvas)
    :minimum-size [800 :by 600]
    [s:6]n-close :exit))

(defn -main [& args]
  (invoke-later (-> test-window pack! show!))
  (listen bview :action (fn [e] (config! main-canvas :paint draw-graph))))

之后就可以跑测试了。由于进化算法自身的特性，可能结果并不完全准确，也并非每次都正确。但获得正确结果的几率是非常高的。
下面是本机运行测试的截图：