12a1f914f4
* update github.com/alecthomas/chroma v0.8.0 -> v0.8.1 * github.com/blevesearch/bleve v1.0.10 -> v1.0.12 * editorconfig-core-go v2.1.1 -> v2.3.7 * github.com/gliderlabs/ssh v0.2.2 -> v0.3.1 * migrate editorconfig.ParseBytes to Parse * github.com/shurcooL/vfsgen to 0d455de96546 * github.com/go-git/go-git/v5 v5.1.0 -> v5.2.0 * github.com/google/uuid v1.1.1 -> v1.1.2 * github.com/huandu/xstrings v1.3.0 -> v1.3.2 * github.com/klauspost/compress v1.10.11 -> v1.11.1 * github.com/markbates/goth v1.61.2 -> v1.65.0 * github.com/mattn/go-sqlite3 v1.14.0 -> v1.14.4 * github.com/mholt/archiver v3.3.0 -> v3.3.2 * github.com/microcosm-cc/bluemonday 4f7140c49acb -> v1.0.4 * github.com/minio/minio-go v7.0.4 -> v7.0.5 * github.com/olivere/elastic v7.0.9 -> v7.0.20 * github.com/urfave/cli v1.20.0 -> v1.22.4 * github.com/prometheus/client_golang v1.1.0 -> v1.8.0 * github.com/xanzy/go-gitlab v0.37.0 -> v0.38.1 * mvdan.cc/xurls v2.1.0 -> v2.2.0 Co-authored-by: Lauris BH <lauris@nix.lv>
178 lines
8.4 KiB
Markdown
Vendored
178 lines
8.4 KiB
Markdown
Vendored
# ZAP File Format
|
|
|
|
## Legend
|
|
|
|
### Sections
|
|
|
|
|========|
|
|
| | section
|
|
|========|
|
|
|
|
### Fixed-size fields
|
|
|
|
|--------| |----| |--| |-|
|
|
| | uint64 | | uint32 | | uint16 | | uint8
|
|
|--------| |----| |--| |-|
|
|
|
|
### Varints
|
|
|
|
|~~~~~~~~|
|
|
| | varint(up to uint64)
|
|
|~~~~~~~~|
|
|
|
|
### Arbitrary-length fields
|
|
|
|
|--------...---|
|
|
| | arbitrary-length field (string, vellum, roaring bitmap)
|
|
|--------...---|
|
|
|
|
### Chunked data
|
|
|
|
[--------]
|
|
[ ]
|
|
[--------]
|
|
|
|
## Overview
|
|
|
|
Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing.
|
|
|
|
|==================================================|
|
|
| Stored Fields |
|
|
|==================================================|
|
|
|-----> | Stored Fields Index |
|
|
| |==================================================|
|
|
| | Dictionaries + Postings + DocValues |
|
|
| |==================================================|
|
|
| |---> | DocValues Index |
|
|
| | |==================================================|
|
|
| | | Fields |
|
|
| | |==================================================|
|
|
| | |-> | Fields Index |
|
|
| | | |========|========|========|========|====|====|====|
|
|
| | | | D# | SF | F | FDV | CF | V | CC | (Footer)
|
|
| | | |========|====|===|====|===|====|===|====|====|====|
|
|
| | | | | |
|
|
|-+-+-----------------| | |
|
|
| |--------------------------| |
|
|
|-------------------------------------|
|
|
|
|
D#. Number of Docs.
|
|
SF. Stored Fields Index Offset.
|
|
F. Field Index Offset.
|
|
FDV. Field DocValue Offset.
|
|
CF. Chunk Factor.
|
|
V. Version.
|
|
CC. CRC32.
|
|
|
|
## Stored Fields
|
|
|
|
Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located.
|
|
|
|
0 [SF] [SF + D# * 8]
|
|
| Stored Fields | Stored Fields Index |
|
|
|================================|==================================|
|
|
| | |
|
|
| |--------------------| ||--------|--------|. . .|--------||
|
|
| |-> | Stored Fields Data | || 0 | 1 | | D# - 1 ||
|
|
| | |--------------------| ||--------|----|---|. . .|--------||
|
|
| | | | |
|
|
|===|============================|==============|===================|
|
|
| |
|
|
|-------------------------------------------|
|
|
|
|
Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data.
|
|
|
|
Stored Fields Data
|
|
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
|
|
| MDS | CDS | MD | CD |
|
|
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
|
|
|
|
MDS. Metadata size.
|
|
CDS. Compressed data size.
|
|
MD. Metadata.
|
|
CD. Snappy-compressed data.
|
|
|
|
## Fields
|
|
|
|
Fields Index section located between addresses `F` and `len(file) - len(footer)` and consist of `uint64` values (`F1`, `F2`, ...) which are offsets to records in Fields section. We have `F# = (len(file) - len(footer) - F) / sizeof(uint64)` fields.
|
|
|
|
|
|
(...) [F] [F + F#]
|
|
| Fields | Fields Index. |
|
|
|================================|================================|
|
|
| | |
|
|
| |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------||
|
|
||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 ||
|
|
|| |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------||
|
|
|| | | |
|
|
||===============================|==============|=================|
|
|
| |
|
|
|----------------------------------------------|
|
|
|
|
|
|
## Dictionaries + Postings
|
|
|
|
Each of fields has its own dictionary, encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term.
|
|
|
|
|================================================================|- Dictionaries +
|
|
| | Postings +
|
|
| | DocValues
|
|
| Freq/Norm (chunked) |
|
|
| [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
|
|
| |->[ Freq | Norm (float32 under varint) ] |
|
|
| | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
|
|
| | |
|
|
| |------------------------------------------------------------| |
|
|
| Location Details (chunked) | |
|
|
| [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
|
|
| |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | |
|
|
| | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
|
|
| | | |
|
|
| |----------------------| | |
|
|
| Postings List | | |
|
|
| |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | |
|
|
| |->| F/N | LD | Length | ROARING BITMAP | | |
|
|
| | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | |
|
|
| | |----------------------------------------------| |
|
|
| |--------------------------------------| |
|
|
| Dictionary | |
|
|
| |~~~~~~~~|--------------------------|-...-| |
|
|
| |->| Length | VELLUM DATA : (TERM -> OFFSET) | |
|
|
| | |~~~~~~~~|----------------------------...-| |
|
|
| | |
|
|
|======|=========================================================|- DocValues Index
|
|
| | |
|
|
|======|=========================================================|- Fields
|
|
| | |
|
|
| |~~~~|~~~|~~~~~~~~|---...---| |
|
|
| | Dict | Length | Name | |
|
|
| |~~~~~~~~|~~~~~~~~|---...---| |
|
|
| |
|
|
|================================================================|
|
|
|
|
## DocValues
|
|
|
|
DocValues Index is `F#` pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice.
|
|
|
|
|================================================================|
|
|
| |------...--| |
|
|
| |->| DocValues |<-| |
|
|
| | |------...--| | |
|
|
|==|=================|===========================================|- DocValues Index
|
|
||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|
|
|| DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END ||
|
|
||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|
|
|================================================================|
|
|
|
|
DocValues is chunked Snappy-compressed values for each document and field.
|
|
|
|
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
|
|
[ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ]
|
|
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
|
|
|
|
Last 16 bytes are description of chunks.
|
|
|
|
|~~~~~~~~~~~~...~|----------------|----------------|
|
|
| Chunk Sizes | Chunk Size Arr | Chunk# |
|
|
|~~~~~~~~~~~~...~|----------------|----------------|
|