當(dāng)前位置：首頁 > news >正文

珠海企業(yè)網(wǎng)站推廣服務(wù)找關(guān)鍵詞的方法與技巧

news 2025/7/4 21:48:41

珠海企業(yè)網(wǎng)站推廣服務(wù),找關(guān)鍵詞的方法與技巧,手機(jī)網(wǎng)站開發(fā)和pc網(wǎng)站的區(qū)別,wordpress搜圖插件Elasticsearch 的存儲與查詢在搜索系統(tǒng)領(lǐng)域，數(shù)據(jù)的存儲與查詢是兩個(gè)最基礎(chǔ)且至關(guān)重要的環(huán)節(jié)。Elasticsearch(ES) 在這兩方面進(jìn)行了深度優(yōu)化，使其在關(guān)系型數(shù)據(jù)庫或非關(guān)系型數(shù)據(jù)庫中脫穎而出，成為搜索系統(tǒng)的首選。映射 (Mapping) 映射 (Ma…

Elasticsearch 的存儲與查詢

在搜索系統(tǒng)領(lǐng)域，數(shù)據(jù)的存儲與查詢是兩個(gè)最基礎(chǔ)且至關(guān)重要的環(huán)節(jié)。Elasticsearch(ES) 在這兩方面進(jìn)行了深度優(yōu)化，使其在關(guān)系型數(shù)據(jù)庫或非關(guān)系型數(shù)據(jù)庫中脫穎而出，成為搜索系統(tǒng)的首選。

映射 (Mapping)

映射 (Mapping)
- 映射是 ES 中的一種元數(shù)據(jù)，用于描述文檔中的數(shù)據(jù)結(jié)構(gòu)和類型。映射可以在創(chuàng)建索引時(shí)自動推斷，也可以手動定義。映射包括字段名、字段類型、是否可搜索、是否可分析等屬性。映射可以在文檔級別和索引級別定義。
- ES 中是實(shí)現(xiàn)了動態(tài)映射的，在索引中寫入下面的一個(gè)文檔。

{"name":"jack","age":18,"birthDate": "1991-10-05"
}

在動態(tài)映射的作用下，name 會映射成 text 類型，age 會映射成 long 類型，birthDate 會被映射為 date 類型，自動判斷的規(guī)則如下

JSON Type	Field Type
true, false	boolean
123, 456, 876	long
123.43, 234.534	double
String, “2022-05-15”	date
String: “Hello Elasticsearch”	string
符合 IPv4 或 IPv6 地址格式	ip
字段的內(nèi)容是 base64 編碼的字符串	binary
字段的內(nèi)容是一個(gè)數(shù)組，數(shù)組中的每個(gè)元素都將根據(jù)其內(nèi)容被映射為相應(yīng)的類型	array
字段的內(nèi)容是一個(gè) JSON 對象，那么它將被映射為 object 類型。對象中的每個(gè)屬性都將根據(jù)其內(nèi)容被映射為相應(yīng)的類型	object
字段的內(nèi)容是一個(gè)數(shù)組，且數(shù)組中的元素是對象，并且可以對內(nèi)部對象進(jìn)行精確查詢	nested
字段的內(nèi)容是經(jīng)緯度對	geo_point
字段的內(nèi)容是除點(diǎn)外的任意幾何形狀坐標(biāo)	geo_shape

Mapping 的字段類型

字段類型是映射中的一種屬性，用于描述文檔中的字段數(shù)據(jù)類型。ES 支持多種字段類型，如文本、數(shù)值、日期、布爾值等。每種字段類型有其特點(diǎn)和限制，因此選擇合適的字段類型對于優(yōu)化查詢性能和存儲空間至關(guān)重要。

一級分類	二級分類	具體類型
核心類型	字符串類型	~~string~~，text，keyword
	整數(shù)類型	integer，long，short，byte
	浮點(diǎn)類型	double，float，half_float，scaled_float
	邏輯類型	boolean
	日期類型	date
	范圍類型	range(Integer_range，long_range，date_range…)
	二進(jìn)制類型	binary (BASE64 的二進(jìn)制)
復(fù)合類型	數(shù)組類型	array
	對象類型	object
	嵌套類型	nested
地理類型	地理坐標(biāo)類型	geo_point
	地理地圖	geo_shape
特殊類型	IP 類型	ip
…	…	…

字符串類型
- text: ES 5x 后不再支持 string, 由 text 和 keyword 類型替代
  - 子類型
    - text: 類型適用于需要被全文檢索的字段
    - match_only_text: 是 text 的空間優(yōu)化變體，它禁用評分，它適合為日志消息建立索引
  - text 類型的常用參數(shù)
    - analyzer: 指明該字段用于索引時(shí)和搜索時(shí)的分析字符串的分詞器 (使用 search_analyzer 可覆蓋它)。默認(rèn)為索引分析器或標(biāo)準(zhǔn)分詞器
    - search_analyzer: 在搜索時(shí)，用于分析該字段的分析器，默認(rèn)是 analyzer 參數(shù)的值
    - boost: 查詢時(shí)字段匹配上時(shí)的權(quán)重級別，接受浮點(diǎn)數(shù)，默認(rèn)為 1.0
    - fields: 它允許同一個(gè)字符串值以多種方式進(jìn)行索引以滿足不同的目的
    - index: 設(shè)置該字段是否可以用于搜索，默認(rèn)為 true
    - eager_global_ordinals: 在刷新時(shí)急切地加載全局序數(shù)，對于經(jīng)常用于 term 聚合的字段，啟用此功能是個(gè)好主意 (但會影響寫入速度)
    - fielddata: 指明該字段是否可以使用內(nèi)存中的 fielddata 進(jìn)行排序，聚合或腳本編寫？該字段可能會消耗大量的內(nèi)存，如果要用的話建議 keyword 類型的字段使用
    - similarity: 設(shè)置相關(guān)性排序公式，默認(rèn)為 BM25
- keyword
  - 子類型
    - keyword: 用于結(jié)構(gòu)化過的內(nèi)容，只能用于精準(zhǔn)搜索，不會進(jìn)行分詞處理，常用戶 ID、Email 等
    - constant_keyword: 某個(gè)字段為 constant_keyword 類型，則該索引中，所有文檔的該字段的值必須一致，常用于版本號
    - wildcard: 這種類型主要用于非結(jié)構(gòu)化的機(jī)器生成內(nèi)容，對大數(shù)據(jù)量的字段做了優(yōu)化，支持模糊匹配，常用于日志服務(wù)
  - keyword 類型的常用參數(shù)
    - ignore_above: 當(dāng)字段文本的長度大于指定值時(shí)，不會被索引，但是會存儲
    - 其它參數(shù): 同 text


## fields 字段示例
{"properties": {"title": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart","similarity": "custom_similarity","doc_values": false,"fields": {"title_smart": {"type": "text","analyzer": "ik_smart","search_analyzer": "ik_smart","similarity": "custom_similarity","doc_values": false}}}}
}# 自定義排序算法
"similarity": {"custom_similarity": {"type": "BM25","b": 0.9,"k1": 1.2}
}

數(shù)字類型

Type	Description
long	A signed 64-bit integer with a minimum value of -263 and a maximum value of 263-1.
integer	A signed 32-bit integer with a minimum value of -231 and a maximum value of 231-1.
short	A signed 16-bit integer with a minimum value of -32,768 and a maximum value of 32,767.
byte	A signed 8-bit integer with a minimum value of -128 and a maximum value of 127.
double	A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.
float	A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.
half_float	A half-precision 16-bit IEEE 754 floating point number, restricted to finite values.
scaled_float	A floating point number that is backed by a long, scaled by a fixed double scaling factor.
unsigned_long	An unsigned 64-bit integer with a minimum value of 0 and a maximum value of 264-1.

范圍類型

integer_range 和 integer 類型的區(qū)別在于，如果你的字段只包含一個(gè)整數(shù)值，你可以使用 integer 類型。如果你的字段包含一個(gè)整數(shù)范圍，你可以使用 integer_range 類型

Range Type	Description
integer_range	A range of signed 32-bit integers with a minimum value of -231 and maximum of 231-1.
float_range	A range of single-precision 32-bit IEEE 754 floating point values.
long_range	A range of signed 64-bit integers with a minimum value of -263 and maximum of 263-1.
double_range	A range of double-precision 64-bit IEEE 754 floating point values.
date_range	A range of date values. Date ranges support various date formats through the format mapping parameter. Regardless of the format used, date values are parsed into an unsigned 64-bit integer representing milliseconds since the Unix epoch in UTC. Values containing the now date math expression are not supported.
ip_range	A range of ip values supporting either IPv4 or IPv6 (or mixed) addresses.

# integer 存儲例子
{"name": "John","age": 25
}# integer_range 存儲例子
{"name": "John","age": {"gte": 20,"lte": 30}
}# 若用 keyword 類型定義 my_field, 則范圍查詢會變成"字符串比較"而非"數(shù)值比較"
GET /keyword_test/_search
{"query": {"range": {"my_field": {"gte": 21,"lt": 32}}}
}"hits" : [{"_index" : "keyword_test","_type" : "_doc","_id" : "1","_score" : 1.0,"_source" : {"my_field" : "3"}}
]

對象類型
- 對象類型即一個(gè) JSON 對象，JSON 字符串允許嵌套對象，所以一個(gè)文檔可以嵌套多個(gè)多層對象。但 Lucene 沒有內(nèi)部對象的概念，會將 JSON 文檔扁平化

# 插入這個(gè)的一個(gè)文檔
PUT my-index-000001/_doc/1
{"region": "US","manager": {"age":     30,"name": {"first": "John","last":  "Smith"}}
}# 實(shí)際上該文檔會被存儲成 key-value 對的形式
{"region":             "US","manager.age":        30,"manager.name.first": "John","manager.name.last":  "Smith"
}# 該文檔會被動態(tài)的 mapping
{"mappings": {"properties": {"region": {"type": "keyword"},"manager": {"properties": {"age":  { "type": "integer" },"name": {"properties": {"first": { "type": "text" },"last":  { "type": "text" }}}}}}}
}# 在搜索時(shí)
{"query": {"bool": {"must": [{"match": {"manager.name.first": "John"}}]}}
}

嵌套類型
- nested 是一種特殊類型的 object 類型，它可以以數(shù)組對象的形式來進(jìn)行索引，并且可以獨(dú)立的查詢其中的每一個(gè)對象


# ES 是沒有內(nèi)部對象的概念的，下述例子會動態(tài)的轉(zhuǎn)化成一個(gè)"扁平化"的對象
{"group" : "fans","user" : [{"first" : "John","last" :  "Smith"},{"first" : "Alice","last" :  "White"}]
}# 類似下述的方式存儲
{"group" :        "fans","user.first" : [ "alice", "john" ],"user.last" :  [ "smith", "white" ]
}# 執(zhí)行下述 query 會得到不正確的結(jié)果，因?yàn)閷ο笫チ藢蛹壗Y(jié)構(gòu)
{"query": {"bool": {"must": [{ "match": { "user.first": "Alice" }},{ "match": { "user.last":  "Smith" }}]}}
}# 以嵌套類型來定義字段
{"mappings": {"properties": {"user": {"type": "nested"}}}
}# 插入相同的 doc
{"group" : "fans","user" : [{"first" : "John","last" :  "Smith"},{"first" : "Alice","last" :  "White"}]
}# 以 nested 的方式進(jìn)行檢索，無結(jié)果返回
{"query": {"nested": {"path": "user","query": {"bool": {"must": [{ "match": { "user.first": "Alice" }},{ "match": { "user.last":  "Smith" }}]}}}}
}# 在 ES 中，嵌套類型的字段會被轉(zhuǎn)化成獨(dú)立的文檔，這些文檔和主文檔有相同的 id

向量類型
- 支持最大不超過 2048 維的向量，dense_vector 字段不支持查詢、排序和聚合，只能接受 scripts 定義的函數(shù)

# 定義向量類型和維度
{"mappings": {"properties": {"my_vector": {"type": "dense_vector","dims": 768}}}
}# 定義相似度函數(shù)
{
"query": {"bool": {"must": {"function_score": {"functions": [{"script_score": {"script": {"source": "cosineSimilarity(params.queryVector, 'TitleVector') + 2.0","params": {"queryVector": query_embedding}}}}]}}}},
}

查詢

分析器
- 被分析的字符串片段通過 analyzer 來傳遞，將字符串轉(zhuǎn)換為一串 terms(詞條) 用于索引及檢索
- 分析器 analyzer 和分詞器 tokenizer 并不相同，分析器不等于分詞器，分詞器只是分析器的一部分
- analyzer = [char_filter] + tokenizer + [token filter]
  - char filter: 對輸入字符進(jìn)行預(yù)處理，如去除 HTML 標(biāo)簽，ES 內(nèi)置字符處理器
  - tokenizer: 對文本進(jìn)行分詞操作，如按照空格分詞 (whitespace)，標(biāo)注分詞器 (standard) 等，ES 內(nèi)置分詞器
    - standard: Elasticsearch 默認(rèn)的分詞器，它會根據(jù)空格和標(biāo)點(diǎn)符號將文本拆分為 term, 會過濾掉標(biāo)點(diǎn)符號，大寫轉(zhuǎn)小寫
    - simple: 會根據(jù)非字母字符將文本拆分為 term, 過濾數(shù)字和標(biāo)點(diǎn)符號，大寫轉(zhuǎn)小寫
    - whitespace: 根據(jù)空格字符將文本拆分為 term, 不會進(jìn)行過濾和大小寫轉(zhuǎn)換
    - keyword: 不會對文本進(jìn)行拆分，常用于關(guān)鍵字字段或精確匹配字段
  - filter (token filter): 對 token 集合的元素做過濾和轉(zhuǎn)換，如統(tǒng)一轉(zhuǎn)小寫、過濾停用詞等，token 經(jīng)過 filter 處理之后的結(jié)果被定義為：term, ES 內(nèi)置 token filter

POST _analyze
{"analyzer": "simple","text": ["HI 111 , 哈哈"]
}# 分詞結(jié)果
{"tokens" : [{"token" : "hi","start_offset" : 0,"end_offset" : 2,"type" : "word","position" : 0},{"token" : "哈哈","start_offset" : 9,"end_offset" : 11,"type" : "word","position" : 1}]
}

match 查詢
- 支持全文搜索和精確查詢，取決于字段是否支持全文檢索
- operator 默認(rèn)情況下該操作符是 or, 我們可以將它修改成 and 讓所有指定詞項(xiàng)都必須匹配
- minimum_should_match 最小匹配參數(shù)，可以指定必須匹配的詞項(xiàng)數(shù) (或者百分?jǐn)?shù)) 來表示一個(gè)文檔是否相關(guān)

# 全文搜索
GET job_item_profile/_search
{"query": {"match": {"job_name": "java 工程師"}}
}# 精確查詢
# 對于精確值的查詢，可以使用 filter 語句來取代 query，因?yàn)?filter 將會被緩存
GET job_item_profile/_search
{"query": {"match": {"edu_level": "5000"}}
}# operator
GET job_item_profile/_search
{"query": {"match": {"job_name": {"query": "java 工程師","operator": "and"}}}
}# minimum_should_match
GET job_item_profile/_search
{"query": {"match": {"job_name": {"query": "java 工程師","minimum_should_match": "2"}}}
}

multi_match 查詢
- 多字段查詢，可以給不同的字段指定不同的權(quán)重，返回匹配更高的結(jié)果

GET job_item_profile/_search
{"query": {"multi_match": {"query": "red","fields": ["job_name^2.0", "company_name^1.0"]}}
}

range 查詢
- 范圍查詢操作符：gt (大于), gte(大于等于), lt(小于), lte（小于等于)

GET job_item_profile/_search
{"query": {"range": {"salary_max": {"gt": 4,"lt": 8}}}
}

term 查詢
- term 查詢會去倒排索引中尋找確切的 term, 它并不會走分詞器，只會去配倒排索引，若某個(gè)字段的 type 是 text, 若用 term 去查詢有可能出現(xiàn)查詢不到的情況
- term 查詢也不會處理大小寫，type 是 text 的字段會調(diào)用分詞器進(jìn)行大小寫轉(zhuǎn)換
terms 查詢
- terms 查詢與 term 查詢一樣，但它允許你指定多值進(jìn)行匹配，如果這個(gè)字段包含了指定值中的任何一個(gè)值，那么這個(gè)文檔就滿足條件

GET job_item_profile/_search
{"query": {"terms": {"edu_level": [5000, 6000]}}
}

match_phrase
- 短語查詢/精確匹配，查詢"java 專家"會匹配 job_name 字段包含"java 專家"短語的，而不會進(jìn)行分詞查詢，也不會查詢出"java 技術(shù)專家"這種詞匯

GET job_item_profile/_search
{"query": {"match_phrase": {"job_name": "java 專家"}}
}

復(fù)合查詢
- 使用 bool 語句實(shí)現(xiàn)復(fù)合查詢，包括 must, must_not, should 和 filter
- must: 表示文檔一定要包含查詢的內(nèi)容
- must_not: 表示文檔一定不要包含查詢的內(nèi)容
- should: 表示如果文檔匹配上可以增加文檔相關(guān)性得分
- query DSL: 結(jié)構(gòu)化查詢，用于檢查內(nèi)容與條件是否匹配，內(nèi)容查詢中使用的 bool 和 match 語句，用于計(jì)算每個(gè)文檔的匹配得分
- filter DSL: 結(jié)構(gòu)化過濾，只是簡單的決定文檔是否匹配，內(nèi)容過濾中使用的 term 和 range 語句，會過濾掉不匹配的文檔，并且不影響計(jì)算文檔匹配得分，使用過濾查詢會被 ES 自動緩存用來提高效率
- 原則上來說，使用結(jié)構(gòu)化查詢語句做全文本搜索或其他需要進(jìn)行相關(guān)性評分的情況，剩下的全部用過濾語句

實(shí)踐

列表字段如何處理
- text
  - 純中文逗號分隔，simple
  - 純英文，可能包含下劃線，空格分隔，standard
  - 純數(shù)字，空格分隔，standard
- array, 通用方案，type=keyword, double. 注意，手動轉(zhuǎn)小寫。
IK 分析器的使用方式
- 網(wǎng)上很多文章會建議，建立索引的時(shí)候使用 ik_max_word 模式；搜索的時(shí)候使用 ik_smart 模式。但在實(shí)際應(yīng)用中，我們會發(fā)現(xiàn) ik_smart 的結(jié)果并不完全是 ik_max_word 結(jié)果的子集，這樣會出現(xiàn)搜不出的情況參考: IK分詞器實(shí)現(xiàn)原理剖析 —— 一個(gè)小問題引發(fā)的思考
- 一種解決方法如下所示，對 title 字段分別以 ik_max_word 方式建立 title 字段，再以 ik_smart 方式建立 title_smart 子字段；在用戶搜索時(shí)統(tǒng)一使用 ik_smart 方式進(jìn)行搜索，這樣能保證相關(guān)的 query 一定能命中索引

  {"properties": {"title": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart","similarity": "custom_similarity","doc_values": false,"fields": {"title_smart": {"type": "text","analyzer": "ik_smart","search_analyzer": "ik_smart","similarity": "custom_similarity","doc_values": false}}}}
}

查看全文

http://m.risenshineclean.com/news/6849.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

珠海企業(yè)網(wǎng)站推廣服務(wù)找關(guān)鍵詞的方法與技巧

Elasticsearch 的存儲與查詢

映射 (Mapping)

查詢

實(shí)踐

相關(guān)文章：