elasticsearch高亮之词项向量

博客分享

 0  229

优雅殿下 2022-03-15 07:56:38

悬赏：0 积分收藏

elasticsearch高亮之词项向量

一、什么是词项向量

词项向量(term vector)是有elasticsearch在index document的时候产生，其包含对document解析过程中产生的分词的一些信息，例如分词在字段值中的位置、开始和结束的字符位置、分词的元数据payloads等；

term vector是单独进行存储的，会额外多占用一杯的空间，所以elasticsearch默认情况下禁用词项向量，如果要启用，我们需要在字段的mapping中使用term_vector进行设置；

二、term_vector的配置选项

term vector支持以下配置选项

配置选项	描述
no	不启用term vector，默认值
yes	启用term vector，但是仅仅记录分词
with_positions	启用term vector, 记录分词及分词在字符串中的位置
with_offsets	启用term vector, 记录分词在字符串中的起始字符位置
with_positions_offsets	启用term vector, 记录分词在字符串中的位置及起始的字符位置
with_positions_payloads	启用term vector, 记录分词在字符串中的位置及payloads
with_positions_offsets_payloads	启用term vector, 记录分词在字符串中的位置、起始字符位置及payloads

我们使用以下mapping配置，为text、fullname字段启用term vector；

PUT /term_vector_test/{    "mappings":{        "_doc":{            "properties":{                "text":{                    "type":"text",                    "term_vector":"with_positions_offsets_payloads",                    "store":true,                    "analyzer":"standard"                },                "fullname":{                    "type":"text",                    "term_vector":"with_positions_offsets_payloads",                    "analyzer":"standard"                }            }        }    },    "settings":{        "index":{            "number_of_shards":1,            "number_of_replicas":0        }    }}

将以下两个document发送到elasticsearch进行index；

PUT /term_vector_test/_doc/1{  "fullname" : "John Doe",  "text" : "twitter test test test "}PUT /term_vector_test/_doc/2{  "fullname" : "Jane Doe",  "text" : "Another twitter test ..."}

三、查看term vector的数据结构

elasticsearch提供了_termvectors API，我们可以使用它来查看我们刚才index的doucment产生的term vector；

这个API每次只能查看特定的某个文档的term vector信息，我们可以通过url指定具体的document的_id;

term vector主要由term information、term statistics、field statistics构成，其中term information又分成了positions、offsets、payloads三个选项，我们可以通过请求的body的参数分别控制返回的信息；

下边我们查看id=1的文档的text字段的term vector信息；

GET /twitter/_doc/1/_termvectors{  "fields" : ["text"],  "offsets" : true,  "payloads" : true,  "positions" : true,  "term_statistics" : true,  "field_statistics" : true}

通过返回的信息可以看到erm vecter由三部分组成

分词基本信息

term position，分词在字段值中的位置，可以看到分词test在字段中占据下标为1、2、3三个位置，而分词twitter占据下标为0的位置；

start and end offsets, 分词在字段值中字符开始和结束位置，可以看到分词twitter的start_offset和end_offset分别为0和7；

term payloads，分词的元数据，可以看到每个分词的payload都是d29yZA==，从这里可以到elasticsearch默认值为 word；

term frequency,分词在字段值中出现的频率，可以看到分词twitter的term_freq是 1；

分词统计信息

total term frequency，当前分词在所有文档的当前字段中出现的频率，可以看到twitter的ttf是2，test的ttf是4；

document frequency，当前字段包含当前分词的文档的数量，可以看到两个document的text字段都包含test及twitter,所以两者的doc_freq为2；

字段统计信息

document count, 包含当前字段的document数量，这里两个文档都包含text字段，所以doc_count为2；

sum of document frequencies，当前字段中所有分词对应的document frequency的加和，这里以下计算可以得到sum_doc_freq为6；

\[df_{sum}(text) = df(test) + df(twitter) + df(anther) + df(...) = 2 + 2 + 1 + 1 = 6\]

sum of total term frequencies,当前字段中所有分词对应的total term frequency的加和，这里以下计算可以得到sum_ttf为8；

\[tf_{sum}(text) = tf(test) + tf(twitter) + tf(anther) + tf(...) = 4 + 2 + 1 + 1 = 8\]

{  "_index" : "twitter",  "_type" : "_doc",  "_id" : "1",  "_version" : 1,  "found" : true,  "took" : 0,  "term_vectors" : {    "text" : {      "field_statistics" : {        "sum_doc_freq" : 6,        "doc_count" : 2,        "sum_ttf" : 8      },      "terms" : {        "test" : {          "doc_freq" : 2,          "ttf" : 4,          "term_freq" : 3,          "tokens" : [            {              "position" : 1,              "start_offset" : 8,              "end_offset" : 12,              "payload" : "d29yZA=="            },            {              "position" : 2,              "start_offset" : 13,              "end_offset" : 17,              "payload" : "d29yZA=="            },            {              "position" : 3,              "start_offset" : 18,              "end_offset" : 22,              "payload" : "d29yZA=="            }          ]        },        "twitter" : {          "doc_freq" : 2,          "ttf" : 2,          "term_freq" : 1,          "tokens" : [            {              "position" : 0,              "start_offset" : 0,              "end_offset" : 7,              "payload" : "d29yZA=="            }          ]        }      }    }  }}

基于以下两点term statistics和field statistics并不是准确的；

删除的文档不会计算在内；

只计算请求文档所在的分片的数据；

posted @ 2022-03-15 07:29 无风听海阅读(5) 评论(0) 编辑收藏举报

上一篇：【.NET6+WPF+Avalonia】开发支持跨平台的WPF应用程序以及基于ubuntu系统的演示

下一篇：同事都说有SQL注入风险，我非说没有

回帖

优雅殿下（王者段位）

2018 积分 (2)粉丝 (47)源码

小小码农，大大世界

温馨提示

您可以通过每日签到获得积分；
您也可以通过发布源码或者分享技术获得积分；

亦奇源码

elasticsearch高亮之词项向量

elasticsearch高亮之词项向量

优雅殿下 （王者 段位）

温馨提示

最新会员

优雅殿下（王者段位）