當前位置：首頁 > news >正文

網站qq獲取百度新聞客戶端

news 2025/7/14 7:14:05

網站qq獲取,百度新聞客戶端,開一間網站建設有限公司,wordpress 后臺樣式修改文章目錄 4.保存到磁盤中為什么要保存在磁盤中怎么保存操作步驟1. 前期準備2. 主要操作 5. 將磁盤中的數據加載到內存中Parser 類完整源碼Index 類完整源碼 4.保存到磁盤中為什么要保存在磁盤中索引本來是存儲在內存中的，為什么要將其保存在硬盤中？ …

文章目錄

4.保存到磁盤中
- 為什么要保存在磁盤中
- 怎么保存
- 操作步驟
- - 1. 前期準備
  - 2. 主要操作
5. 將磁盤中的數據加載到內存中
Parser 類完整源碼
Index 類完整源碼

4.保存到磁盤中

為什么要保存在磁盤中

索引本來是存儲在內存中的，為什么要將其保存在硬盤中？

因為創(chuàng)建索引是比較耗時的

因此我們不應該在服務器啟動的時候，才構建索引（啟動服務器就可能會拖慢很多很多）

通常的做法是：把這些耗時的操作，單獨去進行執(zhí)行
單獨執(zhí)行完了之后，再讓線上服務器直接加載這個構造好的索引

怎么保存

文本實質上就是字符串，我們就可以把字符串直接保存在文件中。我們就需要把內存中的索引結構變成一個“字符串”，然后寫文件即可

變成字符串的過程就是——序列化
對應的特定結構的字符串，反向解析成一些結構化數據（類/對象/基礎數據結構）——反序列化

序列化和反序列化有很多現(xiàn)成的通用方法，此處咱們就直接使用 JSON 格式來進行序列化/反序列化——jackson

通過 Maven 倉庫，引入依賴

<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>2.18.2</version>
</dependency>

操作步驟

1. 前期準備

引入一個 jackson 里面會用到的核心對象

private ObjectMapper objectMapper = new ObjectMapper();

之后就通過這個對象，完成后續(xù)的序列化和反序列化操作

創(chuàng)建一個文件指定存放的目錄

private static final String INDEX_PATH =
"/Users/yechiel/Desktop/Byte/code_world/Gitee/java_doc_searcher";

2. 主要操作

使用兩個文件，分別保存正排和倒排

先判定一下索引對應的目錄是否存在，不存在就創(chuàng)建
然后在索引中分別創(chuàng)建兩個文件：forwardIndexFile (正排文件)、invertedIndexFile (倒排文件)
使用 writeValue 方法，將文件進行寫入

public void save(){  // 使用兩個文件，分別保存正排和倒排  // 1. 先判斷一下，索引對應的目錄是否存在，不存在就創(chuàng)建  File indexPathFile = new File(INDEX_PATH);  if(!indexPathFile.exists()){  indexPathFile.mkdirs();  }  File forwardIndexFile = new File(INDEX_PATH + "fordword.txt");  File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");  try {  // 第一個參數：寫到哪個文件里    第二個：對哪個對象進行寫入  objectMapper.writeValue(forwardIndexFile, forwardIndex);  objectMapper.writeValue(invertedIndexFile, invertedIndex);  }catch (IOException e) {  e.printStackTrace();  }  
}

mkdirs() 可以一次嵌套創(chuàng)建多級目錄
writeValue 方法會報錯，要在兩個操作外面加上 try-catch。這里調用這個方法就不用我們再將文件變成字符串，然后再寫入文件，這里直接進行寫入就方便了很多

5. 將磁盤中的數據加載到內存中

public void load(){  System.out.println("加載索引開始！");  // 1. 設置加載索引的路徑（和前面保存的路徑一樣）  File forwardIndexFile = new File(INDEX_PATH + "forward.txt");  File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");  try{  // 第一個參數：從哪里讀    第二個參數：當前讀到的數據，按照什么類型進行解析  forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {});  invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {});}catch (IOException e){  e.printStackTrace();  }  System.out.println("加載索引結束！");  
}

readValue 就會直接讀取到文件內容，并且把文件內容按照這里指定的類型進行解析
1. 看見這個類型是 ArrayList<>，然后就預期文件里面的 jason 也是代大括號的數組
2. 然后看到每一個元素又是 DocInfo，我們的 readValue 就期望，我們的數據里面的大括號里面的每一個字段都得和 DocInfo 是相對應的
  - 這個對應關系我們是可以保證的，因為前面存入磁盤的時候，就是用 objectMapper 的 writeValue() 來去把對象生成 JSON 然后保存的
  - 生成的時候就是按照每一個屬性名為 key 來去存的，所以下面解析的時候也是和上面相對應的，根據得到的 JSON 中的每一個 key 的值，來去找到對應對象中的屬性，然后給其賦值

這里需要將這個這個結構的字符串，轉換成一個 ArrayList<DocInfo> 類型的對象，jakson 專門提供了一個輔助工具類—— TypeReference<>

這是一個帶有泛型參數的類，我們通過這個類的泛型參數，來指定我們實際要轉換的類型

forwardIndex = objectMapper.readValue
(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {});

這里相當于創(chuàng)建了一個匿名內部類的實例（后面 new 的部分）
- 創(chuàng)建一個匿名內部類，這個類實現(xiàn)了 TypeReference
- 同時再創(chuàng)建一個這個匿名內部類的實例
- 創(chuàng)建這個實例的最主要目的，就是為了把 ArrayList<DocInfo> 這個類型信息，告訴 readValue 方法

在 java 中，并不能直接把一個類型作為方法的參數，而是必須得傳一個具體的對象，正因為這個語法限制，我們就必須得繞一個彎。通過一個專門的泛型類，再搭配泛型參數，才能完成這個過程

Parser 類完整源碼

package com.glg.javadoc_searcher;import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;public class Parser {// 先指定一個加載文檔的路徑private static final String INPUT_PATH = "/Users/yechiel/Desktop/Byte/code_world/docs";// 創(chuàng)建一個 Index 實例private Index index = new Index();public void run(){// 整個 Parser 的入口// 1. 根據指定的路徑，枚舉出該路徑中所有的文件(HTML)，這個過程需要把所有子目錄中的文件都獲取到ArrayList<File> fileList = new ArrayList<>();enumFile(INPUT_PATH, fileList);/*for(File file : fileList){System.out.println(file);}System.out.println(fileList.size());
*/// 2. 針對上面羅列出的文件路徑，打開路徑，讀取文件內容，進行解析，并構建索引for(File f : fileList) {// 通過這個方法來解析單個 HTML 文件System.out.println("開始解析： "+ f.getAbsolutePath());parseHTML(f);}// 3. 把在內存中構造好的索引數據結構，保存到指定的文件中index.save();}private void parseHTML(File f) {// 1. 解析出 HTML 的標題String title = parseTitle(f);// 2. 解析出 HTML 對應的 URLString url = parseUrl(f);// 3. 解析出 HTML 對應的正文（有了正文才有后續(xù)的描述）String content = parseContent(f);// 4. 將解析出來的這些信息，加入到索引當中index.addDoc(title,url,content);}// 用來解析 HTML 里面的標題信息private String parseTitle(File f) {String name = f.getName();return name.substring(0, name.length() - ".html".length());}// 用來解析 HTML 里面的 URL 信息private String parseUrl(File f) {String part1 = "https://docs.oracle.com/javase/8/docs/";String part2 = f.getAbsolutePath().substring(INPUT_PATH.length());return part1 + part2;}// 用來解析 HTML 里面的正文信息public String parseContent(File f) {//先按照一個字符一個字符的方式來讀取，以 < 和 > 來控制拷貝數據的開關StringBuilder content = new StringBuilder();try {FileReader fileReader = new FileReader(f);// 加上一個是否要進行拷貝的開關boolean isCopy = true;// 還得準備一個保存結果的 StringBuilder//StringBuilder content = new StringBuilder();while (true) {// 注意：此處的 read() 返回值是 int，不是 char// 按理說，應該是依次讀一個字符，返回 char 就夠了呀？// 此處使用 int 作為返回值，主要是為了表示一些非法情況// 比如說讀到了文件末尾，繼續(xù)讀，就會返回 -1// 我們就可以根據返回的 -1 判斷讀完了int ret= fileReader.read();if(ret == -1) {// 表示文件讀完了break;}// 這個結果不是 -1，那么就是一個合法的字符了char c = (char)ret;if(isCopy){// 開關打開的狀態(tài)，遇到普通字符就應該拷貝到 StringBuilder 中if(c == '<'){// 關閉開關isCopy = false;continue;}if(c == '\n' || c == '\r'){// 為了去掉換行，把換行/回車替換成空格c = ' ';}// 其他字符，直接進行拷貝即可，把結果拷貝到最終的 StringBuilder 中content.append(c);}else {// 開關關閉的狀態(tài)，暫時不拷貝，直到遇到 >if(c == '>'){isCopy = true;}}}fileReader.close();} catch (IOException e) {e.printStackTrace();}return content.toString();}// 第一個參數表示我們從哪個參數開始進行遞歸遍歷// 第二個參數表示遞歸得到的結果private void enumFile(String inputPath, ArrayList<File> fileList) {File rootPath = new File(inputPath);// 把當前目錄中，所包含的目錄名全部獲取到// listFiles 能夠獲取到 rootPath 當前目錄下所包含的文件/目錄（一層目錄，不會進入子文件）File[] files = rootPath.listFiles();for(File f : files) {// 此時我們就根據當前 f 的類型，來決定是否要進行遞歸// 若 f 是一個普通文件，就把 f 加入到 fileList 結果中// 若 f 是一個目錄，就遞歸調用 enumFile 方法，來進一步地獲取子目錄中的內容if(f.isDirectory()) {enumFile(f.getAbsolutePath(),fileList);}else {if (f.getAbsolutePath().endsWith(".html"))fileList.add(f);}}}public static void main(String[] args) {// 通過 main 方法，來實現(xiàn)整個制作索引的過程Parser parser = new Parser();parser.run();}
}

Index 類完整源碼

package com.glg.javadoc_searcher;import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;// 通過這個類，在內存中構造索引結構
public class Index {private static final String INDEX_PATH = "/Users/yechiel/Desktop/Byte/code_world/Gitee/java_doc_searcher/";private ObjectMapper objectMapper = new ObjectMapper();// 使用數組下標表示 docIdprivate ArrayList<DocInfo> forwardIndex = new ArrayList<>();// 使用一個 哈希表 來表示倒排索引// key 就是詞     value 就是一簇和這個詞相關的文章private HashMap<String, ArrayList<Weight>> invertedIndex = new HashMap<>();// 這個類要提供的方法// 1. 給定一個 docId，在正排索引中，查詢文檔的詳細信息public DocInfo getDocInfo(int docId){return forwardIndex.get(docId);}// 2. 給定一個詞，在倒排索引中，查詢哪些文檔和這個詞關聯(lián)// 仔細思考這里的返回值，單純的返回一個整數的 List 是否可行呢？這樣不太好（返回整數是因為 List 里面存的是文檔 id）// 詞和文檔之間是存在一定的“相關性”的（文檔和詞的相關性有強有弱），不是單一的依次排列// 所以我們再創(chuàng)建一個 Weight 類來處理 文檔id 和 文檔與詞 的相關性權重public List<Weight> getInverted(String term){return invertedIndex.get(term);}// 3. 往索引中新增一個文檔public void addDoc(String title, String url, String content){// 新增文檔操作，需要同時給正排索引和倒排索引新增信息// 構建正排索引DocInfo docInfo = buildForward(title, url, content);// 構建倒排索引buildInverted(docInfo);}// 實現(xiàn)倒排索引private void buildInverted(DocInfo docInfo) {// 直接使用內部類，詞頻統(tǒng)計class WordCnt {public int titleCount;public int contentCount;}// 通過一個內部類，將兩個數據裝到一起了，變成一個 HashMap，更方便遍歷// 這個數據結構用來統(tǒng)計詞頻HashMap<String, WordCnt> wordCntHashMap = new HashMap<>();// 3.1 針對文檔標題進行分詞List<Term> terms = ToAnalysis.parse(docInfo.getTitle()).getTerms();// 3.2 遍歷分詞結果，統(tǒng)計每個詞出現(xiàn)的次數for(Term term : terms){// 先判斷一下 term 是否存在String word = term.getName();WordCnt wordCnt = wordCntHashMap.get(word);if(wordCnt == null) {// 如果不存在，就創(chuàng)建一個新的鍵值對，插入進去，titleCount 設為 1WordCnt newWordCnt = new WordCnt();newWordCnt.titleCount = 1;newWordCnt.contentCount = 0;wordCntHashMap.put(word, newWordCnt);}// 如果存在，就找到之前的值，然后把對應的 titleCount + 1wordCnt.titleCount++;}// 3.3 針對正文頁進行分詞terms = ToAnalysis.parse(docInfo.getContent()).getTerms();// 3.4 遍歷分詞結果，統(tǒng)計每個詞出現(xiàn)的次數for(Term term : terms) {String word = term.getName();WordCnt wordCnt = wordCntHashMap.get(word);if(wordCnt == null) {WordCnt newWordCnt = new WordCnt();newWordCnt.titleCount = 0;newWordCnt.contentCount = 1;wordCntHashMap.put(word, newWordCnt);}else{wordCnt.contentCount++;}}// 3.5 把上面的結果匯總到一個 HashMap 里面//    最終文檔的權重，就設定成標題中出現(xiàn)的次數 * 10 + 正文中出現(xiàn)的次數// 3.6 遍歷剛才這個 HashMap，依次來更新倒排索引中的結構// 將 Map 轉換成 Set 進行遍歷（Map 不能直接進行遍歷）for(Map.Entry<String, WordCnt> entry : wordCntHashMap.entrySet()) {// 先根據這里的詞，去倒排索引中查一查// 倒排索引中的一個值——倒排拉鏈List<Weight> invertedList = invertedIndex.get(entry.getKey());// 判斷是不是存在的（空的）if(invertedList == null) {// 如果為空，就插入一個新的鍵值對ArrayList<Weight> newInvertedList = new ArrayList<>();// 把新的文檔（當前的 DocInfo）構造成 Weight 對象，插入進來Weight weight = new Weight();weight.setDocId(docInfo.getDocId());// 權重計算公式：標題中出現(xiàn)的次數 * 10 + 正文中出現(xiàn)的次數weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);newInvertedList.add(weight);invertedIndex.put(entry.getKey(), newInvertedList);}else{// 如果非空，就把當前這個文檔，構造出一個 Weight 對象，插入到倒排拉鏈的后面Weight weight = new Weight();weight.setDocId(docInfo.getDocId());// 權重計算公式：標題中出現(xiàn)的次數 * 10 + 正文中出現(xiàn)的次數weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);invertedList.add(weight);}}}private DocInfo buildForward(String title, String url, String content) {DocInfo docInfo = new DocInfo();docInfo.setDocId(forwardIndex.size());docInfo.setTitle(title);docInfo.setUrl(url);docInfo.setContent(content);forwardIndex.add(docInfo);return docInfo;}// 4. 把內存中的索引結構保存到磁盤中public void save(){long beg = System.currentTimeMillis();// 使用兩個文件，分貝保存正排和倒排System.out.println("保存索引開始！");// 先判斷一下，索引對應的目錄是否存在，不存在就創(chuàng)建File indexPathFile = new File(INDEX_PATH);if(!indexPathFile.exists()){indexPathFile.mkdirs();}File forwardIndexFile = new File(INDEX_PATH + "fordword.txt");File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");try {// 第一個參數：寫到哪個文件里    第二個：對哪個對象進行寫入objectMapper.writeValue(forwardIndexFile, forwardIndex);objectMapper.writeValue(invertedIndexFile, invertedIndex);}catch (IOException e) {e.printStackTrace();}long end = System.currentTimeMillis();System.out.println("保存索引完成！消耗時間為：" + (end - beg) + "ms");}// 5. 把磁盤中的索引數據加載到內存中public void load(){long beg = System.currentTimeMillis();System.out.println("加載索引開始！");// 設置加載索引的路徑（和前面保存的路徑一樣）File forwardIndexFile = new File(INDEX_PATH + "forward.txt");File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");try{// 第一個參數：從哪里讀    第二個參數：當前讀到的數據，按照什么類型進行解析forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {});invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {});}catch (IOException e){e.printStackTrace();}long end = System.currentTimeMillis();System.out.println("加載索引結束！消耗時間為：" + (end - beg) + "ms");}}