python做全文检索引擎

发布时间:2019-09-13 09:24:56编辑:auto阅读(1825)

                      **
    

    python做全文检索引擎

    **
    最近一直在探索着如何用python实现像百度那样的关键词检索功能。说起关键词检索,我们会不由自主地联想到正则表达式。正则表达式是所有检索的基础,python中有个re类,是专门用于正则匹配。然而,光光是正则表达式是不能很好实现检索功能的。

    python有一个whoosh包,是专门用于全文搜索引擎。
    whoosh在国内使用的比较少,而它的性能还没有sphinx/coreseek成熟,不过不同于前者,这是一个纯python库,对python的爱好者更为方便使用。具体的代码如下

    安装

    输入命令行 pip install whoosh

    需要导入的包有:

    fromwhoosh.index import create_in
    
    fromwhoosh.fields import *
    
    fromwhoosh.analysis import RegexAnalyzer
    
    fromwhoosh.analysis import Tokenizer,Token
    

    中文分词解析器

    class ChineseTokenizer(Tokenizer):
        """
        中文分词解析器
        """
        def __call__(self, value, positions=False, chars=False,
                     keeporiginal=True, removestops=True, start_pos=0, start_char=0,
                     mode='', **kwargs):
            assert isinstance(value, text_type), "%r is not unicode "% value
            t = Token(positions, chars, removestops=removestops, mode=mode, **kwargs)
            list_seg = jieba.cut_for_search(value)
            for w in list_seg:
                t.original = t.text = w
                t.boost = 0.5
                if positions:
                    t.pos = start_pos + value.find(w)
                if chars:
                    t.startchar = start_char + value.find(w)
                    t.endchar = start_char + value.find(w) + len(w)
                yield t
    
    
    def chinese_analyzer():
        return ChineseTokenizer()

    构建索引的函数

     @staticmethod
        def create_index(document_dir):
            analyzer = chinese_analyzer()
            schema = Schema(titel=TEXT(stored=True, analyzer=analyzer), path=ID(stored=True),
                            content=TEXT(stored=True, analyzer=analyzer))
            ix = create_in("./", schema)
            writer = ix.writer()
            for parents, dirnames, filenames in os.walk(document_dir):
                for filename in filenames:
                    title = filename.replace(".txt", "").decode('utf8')
                    print title
                    content = open(document_dir + '/' + filename, 'r').read().decode('utf-8')
                    path = u"/b"
                    writer.add_document(titel=title, path=path, content=content)
            writer.commit()

    检索函数

     @staticmethod
        def search(search_str):
            title_list = []
            print 'here'
            ix = open_dir("./")
            searcher = ix.searcher()
            print search_str,type(search_str)
            results = searcher.find("content", search_str)
            for hit in results:
                print hit['titel']
                print hit.score
                print hit.highlights("content", top=10)
                title_list.append(hit['titel'])
            print 'tt',title_list
            return title_list

关键字