装了个sphinx-0.9.9 ,发现只对英文分词感冒,网上搜了下资料发现有个支持中文的改良型sphinx,就是这里
然后就装了下看看
安装过程:
wget http://www.coreseek.cn/uploads/csft/3.1/CentOS5/mmseg-3.1-1.i386.rpm
wget http://www.coreseek.cn/uploads/csft/3.1/CentOS5/csft-3.1-1.1.i386.rpm
安装csft-3.1-1.1.i386.rpm的时候会提示需要个postgresql的动态库的依赖,需要安装
yum install -y postgresql-libs.i386
rpm -Uvh csft-3.1-1.1.i386.rpm mmseg-3.1-1.i386.rpm
(另外需要下载mmseg的源代码包)
把安装后的/etc/csft/example.sql倒入数据库
mysql < example.sql
wget http://www.coreseek.cn/uploads/csft/3.1/Source/mmseg-3.1.tar.gz
tar zxvf mmseg-3.1.tar.gz
cd mmseg-3.1/data
mmseg -u unigram.txt
把生成的unigram.txt.uni 改名并拷贝到相应位置
mv unigram.txt.uni /var/data/dict/uni.lib
cd /etc/csft
vi csfs.conf
编辑索引定义添加
charset_type = zh_cn.utf-8
charset_dictpath = /var/data/dict
保存建立索引
# csft-indexer –all (对所有索引定义建立索引)
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file ‘./csft.conf’…
indexing index ‘test1′…
iniparser: cannot open /var/data/dict/mmseg.ini
1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 1,
pt:1, 1; 3, 6,
pt:3, 24; pt:6, 0; 3, 6,
pt:3, 47279; pt:6, 0; 3, 6,
pt:3, 61; pt:6, 0; 3,
pt:3, 13411; pt:3, 30471; pt:6, 1; pt:3, 30471; pt:3, 24; pt:6, 1; pt:3, 24; pt:6, 0; 1,
3, 6,
pt:1, 1; pt:3, 24; pt:6, 1; pt:3, 24; pt:6, 0; 3,
pt:3, 14538; pt:3, 298; 3, 6,
pt:3, 24; pt:6, 0; 3, 6,
pt:3, 24; pt:3, 154; pt:1, 1; pt:6, 0; pt:1, 1; pt:3, 13411; pt:1, 1; pt:3, 13411; pt:3, 13411; pt:3, 30471; 3,
pt:3, 990; pt:3, 1448; pt:6, 1; pt:3, 1448; pt:6, 0; collected 8 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 8 docs, 269 bytes
total 0.030 sec, 8825.17 bytes/sec, 262.46 docs/sec
然后测试中文分词
# csft-search 测试
Coreseek Full Text Server 3.1
Copyright (c) 2006-2008 coreseek.com
using config file ‘/etc/csft/csft.conf’…
iniparser: cannot open /var/data/dict/mmseg.ini
3, 6,
pt:3, 24; pt:3, 154; pt:6, 0; pt:1, 1; index ‘test1′: query ‘测试 ‘: returned 4 matches of 4 total in 0.014 sec
displaying matches:
1. document=7, weight=2, group_id=3, date_added=Tue Jan 12 15:37:42 2010
id=7
group_id=3
group_id2=11
date_added=2010-01-12 15:37:42
title=12 测试
content=大册,测试
2. document=5, weight=1, group_id=3, date_added=Tue Jan 12 13:38:04 2010
id=5
group_id=3
group_id2=9
date_added=2010-01-12 13:38:04
title=测试
content=一些
3. document=6, weight=1, group_id=3, date_added=Tue Jan 12 15:26:01 2010
id=6
group_id=3
group_id2=10
date_added=2010-01-12 15:26:01
title=标题
content=我的测试
4. document=8, weight=1, group_id=3, date_added=Tue Jan 12 15:42:25 2010
id=8
group_id=3
group_id2=12
date_added=2010-01-12 15:42:25
title=测试 我的
content=先吃饭
words:
1. ‘测试’: 4 documents, 5 hits
据说这个能够大幅度提高order by, group by操作的速度,下次找点数据测试一下