作为练习,我在MATLAB中编写了一个简单的马尔可夫链实现。基本上它是一个基于字母的概率模型来生成单词。
function mc = MC(numStates)
N = numStates;
PI = ones(1,N)/N;
TR = ones(N,N)/N;
mc = struct('logprob',@logprob, 'train',@train, ...
'sample',@sample, 'sampleFiltered',@sampleFiltered);
function train(seqs)
PI = 1 + histc(cellfun(@(c)c(1), seqs)', 1:N); %#'
TR = ones(N,N);
for i=1:numel(seqs)
ind = sub2ind([N N], seqs{i}(1:end-1), seqs{i}(2:end));
TR = TR + reshape(histc(ind,1:N*N), [N N]);
end
PI = bsxfun(@rdivide, PI, sum(PI,2)+(sum(PI,2)==0));
TR = bsxfun(@rdivide, TR, sum(TR,2)+(sum(TR,2)==0));
end
function seq = sample(len)
seq = zeros(1,len);
seq(1) = randsample(1:N, 1, true, PI);
for t=2:len
seq(t) = randsample(1:N, 1, true, TR(seq(t-1),:));
end
end
function seq = sampleFiltered(allowed)
len = numel(allowed);
seq = zeros(1,len);
seq(1) = randsample(allowed, 1, true, PI(allowed));
allowed(find(allowed==seq(1),1,'first')) = [];
for t=2:len-1
seq(t) = randsample(allowed, 1, true, TR(seq(t-1),allowed));
allowed(find(allowed==seq(t),1,'first')) = [];
end
seq(t) = allowed;
seq = seq(seq~=0);
end
function LL = logprob(seq)
LL = log(PI(seq(1))) + ...
sum(log(TR(sub2ind([N N],seq(1:end-1),seq(2:end)))));
end
end
我们需要一些文字来训练模型。我们使用Gutenberg项目的“绿野仙踪”。
%# read the text document
str = lower(urlread('http://www.gutenberg.org/files/55/55.txt'));
SP = char(32); %# delimiter (space)
str(~isstrprop(str, 'alpha')) = SP; %# replace non-letters with spaces
str(findstr(str, [SP SP])) = []; %# consecutive spaces as one
idx = (str == SP); %# location of spaces
df = diff([1 idx 1]);
len = find(df > 0) - find(df < 0); %# length of each word
[seqs gn] = grp2idx(str(~idx)'); %#' map letters to numbers starting from 1
seqs = mat2cell(seqs', 1, len)'; %# put each word in a separate cell
N = length(gn); %# A to Z
最后,我们使用该模型从一组字母或者样本随机单词或例词:
%# train Markov chain
mc = MC(N);
mc.train(seqs);
%# sample a random word
seq = mc.sample(randi([3 10]));
fprintf('word = %s , logP(word)=%f\n', [gn{seq}], mc.logprob(seq))
%# sample a word from a set of letters
letters = lower('markovchains');
lettersIdx = cellfun(@(c) find(strcmp(c,gn)), cellstr(letters')); %#'
seq = mc.sampleFiltered(lettersIdx);
fprintf('word = %s , logP(word)=%f\n', [gn{seq}], mc.logprob(seq))
这里有一堆从字母“markovchains”产生的例子,用日志一起模型中给定词 - 概率:
word = mivorancask , logP(word)=-29.610819
word = arknoamshiv , logP(word)=-32.496090
word = ancoramshik , logP(word)=-29.299897
word = orchisankav , logP(word)=-29.987204
word = avinchasorm , logP(word)=-27.178507
word = aronchaskim , logP(word)=-25.651964
你可以看到,虽然没有一个是正确的话,他们仍然比字母只是一个随机序列更好。显然只使用前一个字符来生成下一个字符是不够的,但仍然可以很容易地扩展到更复杂的情况(N-gram)。
这种方法的好处在于它不仅限于一种语言,而且可以通过简单地从您选择的语言中提供文档来适应任何其他语言。
我看到了......你正在构建一个自动化的诅咒生成器:http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx;) – Guffa 2010-04-16 06:29:26
哈哈,不是很好,但这是一个很好的提醒。这是为了一个家庭安全的游戏,所以我想它会需要一个过滤器。 – 2010-04-16 07:06:59
请注意,该过滤器永远不会满足。你写的任何东西都会在某人的解释中变成一个令人印象深刻的猥亵。 – 2010-07-23 18:03:36