我正在考虑使用狮身人面像作为我的网站的搜索引擎。但是由于我有很多韩文内容,而其他语言如中文和泰文可能会出现,我不知道斯芬克斯如何处理这种类型的内容。Sphinx处理亚洲语言的内容吗?
2
A
回答
4
我使用狮身人面像搜索CJK charcters(中国,日本,和韩文),你需要做的是在你的配置文件的索引块中添加以下行。
index test { ... charset_type = utf-8 ngram_len = 1 ngram_chars = U+3000..U+2FA1F }
2
狮身人面像适用于UTF-8字符(包括韩文,我相信),但是你必须包含一个UTF-8字符代码列表来索引你的sphinx配置文件。
这是我charset_table里的变量怎么看起来像狮身人面像的配置,从欧洲语言添加各种字符:
charset_table = 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
0
在考虑狮身人面像3: -
创建thinking_sphinx.yml
文件config
文件夹中,并把这些行为: -
development:
enable_star: 1
min_infix_len: 3
ngram_len: 1
ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
test:
enable_star: 1
min_infix_len: 1
production:
enable_star: 1
min_infix_len: 3
ngram_len: 1
enable_star: true
ngram_chars: U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6
charset_table: 0..9, A..Z, U+00C0..U+00DE, U+0100, U+0102, U+0104, U+0106, U+0108, U+010A, U+010C, U+010E, U+0110, U+0112, U+0114, U+0116, U+0118, U+011A, U+011C, U+011E, U+0120, U+0122, U+0124, U+0126, U+0128, U+012A, U+012C, U+012E, U+0130, U+0132, U+0134, U+0136, U+0139, U+013B, U+013D, U+013F, U+0141, U+0143, U+0145, U+0147, U+014A, U+014C, U+014E, U+0150, U+0152, U+0154, U+0156, U+0158, U+015A, U+015C, U+015E, U+0160, U+0162, U+0164, U+0166, U+0168, U+016A, U+016C, U+016E, U+0170, U+0172, U+0174, U+0176, U+0178, U+0179, U+017B, U+017D, a..z, U+00DF..U+00F6, U+00F8..U+00FF, U+0101, U+0103, U+0105, U+0107, U+0109, U+010B, U+010D, U+010F, U+0111, U+0113, U+0115, U+0117, U+0119, U+011B, U+011D, U+011F, U+0121, U+0123, U+0125, U+0127, U+0129, U+012B, U+012D, U+012F, U+0131, U+0133, U+0135, U+0137, U+0138, U+013A, U+013C, U+013E, U+0140, U+0142, U+0144, U+0146, U+0148, U+0149, U+014B, U+014D, U+014F, U+0151, U+0153, U+0155, U+0157, U+0159, U+015B, U+015D, U+015F, U+0161, U+0163, U+0165, U+0167, U+0169, U+016B, U+016D, U+016F, U+0171, U+0173, U+0175, U+0177, U+017A, U+017C, U+017E, U+017F, U+0027
相关问题
- 1. C#支持亚洲语言
- 2. Silverlight显示亚洲语言?
- 3. android:默认语言设置为亚洲
- 4. 关于亚洲语言支持问题
- 5. 使用Python刮亚洲语言网站
- 6. String.Starts没有使用亚洲语言?
- 7. 欧洲语言需要SQL Server - NVARCHAR吗?
- 8. 检测键入UITextfield的亚洲语言的Unicode字符
- 9. 亚洲语言情感分析的代码示例 - Python NLTK
- 10. 使用不支持亚洲语言的字体
- 11. 如何强制中文等亚洲语言的字符长度?
- 12. Crystal Reports for VS2008中的亚洲语言PDF显示问题
- 13. 亚洲语言的数据库字符集/ Unicode(UTF-8)
- 14. 多语言API的内容 - 语言
- 15. Segoe UI可选:Segoe UI不支持亚洲语言
- 16. ios应用程序亚洲语言本地化
- 17. Azure表存储是否支持亚洲语言?
- 18. OneDrive API'view.search'似乎不适用于亚洲语言
- 19. 亚洲语言字符通过传输被搞乱
- 20. 如何从RSS提要中过滤亚洲语言?
- 21. 语言内容树
- 22. 如何处理多语言网页内容?
- 23. 我可以禁用CMU Sphinx语法处理吗?
- 24. 针对少数亚洲语言的TTS语音合成器开发
- 25. IBM DB2 TextSearch //语言= AUTO //语言处理
- 26. 的语言XML处理
- 27. 如何在Hadoop/PIG中处理非ASCII /亚洲/中文字符
- 28. c#是适用于Chatterbot开发的语言,涉及自然语言处理吗?
- 29. 更改内容语言
- 30. 处理语言vs javascript?
我发现你的答案非常有用。 – vise 2012-03-11 07:40:39