Python+Lucene

发布时间:2019-08-31 09:55:58编辑:auto阅读(1958)

    Python+Lucene(pylucene) + Paoding的安装配置


    pylucene让Python可以调用Lucene API实现搜索,这个项目紧跟Lucene的步调,对用惯了Python的同学来说是个福音。

    pylucene是通过JCC实现的,JCC读取 jar 包里的public class/method签名,生成C++的包装类,通过JNI(Java Native Interface)调用java的class/mathod。C++代码转成Python的扩展模块,在Python虚拟机里嵌入JVM就可以用了。细节参考http://lucene.apache.org/pylucene/jcc/documentation/readme.html 。

    由于Paoding跟Lucene 2.9版本以前的接口是一致的,因此找了一个最接近的PyLucene版本(pylucene 2.4),但里面的JCC比较老了,因此使用了pylucene 3.3的JCC。

    下文假定 python 2.7.2安装到 /data/python-2.7.2 目录,相关源码保存在 /data/src 目录。


    1 安装 Python

    下载Python 2.7.2
    切换到解压目录
    ./configure --prefix=/data/python-2.7.2 --enable-shared
    make && make install
    export LD_LIBRARY_PATH=/data/python-2.7.2/lib

    安装包 setuptools

    wget
    http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz#md5=7df2a529a074f613b509fb44feefe74e
    tar zxvf setuptools-0.6c11.tar.gz
    cd setuptools-0.6c11
    /data/python-2.7.2/bin/python setup.py install


    2 安装 JCC 2.10

    下载 pylucene-3.3-3-src.tar.gz
    切换到解压目录
    cd jcc

    给 setuptools打补丁
    mkdir tmp
    cd tmp
    unzip -q /data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg
    patch -Nup0 < /data/src/pylucene-3.3-3/jcc/jcc/patches/patch.43.0.6c11
    sudo zip
    /data/python-2.7.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg -f
    cd ..
    rm -rf tmp

    ln -sf /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64 /usr/lib/jvm/java-6-openjdk
    /data/python-2.7.2/bin/python setup.py build
    /data/python-2.7.2/bin/python setup.py install


    3 安装 PyLucene + Paoding

    下载 pylucene-2.4.1-2-src.tar.gz 和 paoding-analysis-2.0.4-beta.zip
    tar zxvf pylucene-2.4.1-2-src.tar.gz
    mkdir paoding
    cd paoding
    unzip ../paoding-analysis-2.0.4-beta.zip

    切换到 pylucene-2.4.1-2解压目录
    vi Makefile 修改内容如下
    ...
    # Linux (Ubuntu 8.10 64-bit, Python 2.5.2, OpenJDK 1.6, setuptools 0.6c9)
    PREFIX_PYTHON=/data/python-2.7.2
    ANT=ant
    PYTHON=$(PREFIX_PYTHON)/bin/python
    JCC=$(PYTHON) -m jcc --shared
    NUM_FILES=2
    ...
    JARS=$(LUCENE_JAR) $(SNOWBALL_JAR) $(HIGHLIGHTER_JAR) $(ANALYZERS_JAR) \
    $(REGEX_JAR) $(QUERIES_JAR) $(INSTANTIATED_JAR) $(EXTENSIONS_JAR) \
    /data/src/paoding/paoding-analysis.jar
    ...
    GENERATE=$(JCC) $(foreach jar,$(JARS),--jar $(jar)) \
    --include /data/src/paoding/lib/commons-logging.jar \
    --package java.lang java.lang.System \
    ...

    运行

    make
    make install

    4 测试

    export LD_LIBRARY_PATH=/data/python-2.7.2/lib
    export PAODING_DIC_HOME=/data/src/paoding/dic
    /data/python-2.7.2/bin/python /data/src/testpylucene.py

    testpylucene.py的内容如下:
    # -*_ coding: utf-8 -*-
    #
    from lucene import *

    texts = ["Python是一个很有吸引力的语言",
    "C++语言也很有吸引力,长久不衰",
    "我们希望Python和C++高手加入",
    "我们的技术巨牛,人人都是高手"]

    def search(searcher, qtext):
        tq = TermQuery(Term("content", qtext))
        hits = searcher.search(tq)
        print "----------------------------------------------"
        print "Query:'%s', %d Found" % (qtext,hits.length())
        for i in range(hits.length()):
            doc = hits.doc(i)
            print "\t",doc.get("content")

    def dump(reader):
        for i in range(reader.maxDoc()):
        print "-----------------------------------------------"
        tv = reader.getTermFreqVector(i, "content")
        for tk in tv.getTerms():
        print tk

    initVM()
    directory = RAMDirectory()
    analyzer = PaodingAnalyzer()
    writer = IndexWriter(directory, analyzer, True)
    for text in texts:
        doc = Document()
        doc.add(Field("content", text, Field.Store.YES, Field.Index.TOKENIZED,
            Field.TermVector.YES))
        writer.addDocument(doc)
    writer.optimize()
    writer.close()
    reader = IndexReader.open(directory)
    dump(reader)
    searcher = IndexSearcher(directory)
    search(searcher, "python")
    search(searcher, "C++")
    search(searcher, "高手")

关键字