Lazero Search Engine Update Logic
This article discusses an efficient method for updating a search engine using advanced tools such as docprompting, ColBERT, and RoBERTa. The process involves managing file lists, scanning new files based on the index, merging, saving, and removing old indexes while also handling large datasets in minibatches when necessary.
docprompting generate code from doc retrieval, using tldr and CoNaLa for training code generation from prompt
ColBERT and RoBERTa for document retrieval and embedding
the update process shall be atomic. when the update is successful, there should be a file created under index directory. always check the newest index first. cleanup unusable/incompatible indexs.
if there’s no previous compatible index present, make index from group up, clean up incompatible index if necessary. if previous compatible index is found, decompose it into small groups, waiting for merge and update.
first checksum all files along with file names. if file is present with matched checksum, don’t touch it, or either remove it from index, create new index or replace index.
next create or merge file list.
then we scan those new files then act accordingly to our index.
finally we merge our index, save to a different place, place the flag, remove the flag of old index then remove old index completely. if merge is not possible for huge datasource, we perform search in minibatches.