Folder Structure

Sample Structure Index

Project

In a project, you have to do an experiment with only one dataset. The project folder is where the user runs from.

project_folders

trial

Trial means a single run of the experiment. Each trial can be run with different settings using different config YAML files. If there are multiple trail folders, it means you ran experiments more than once. We recommend running multiple trials on the same data with different settings to find the best RAG pipeline.

The folder names are determined by the number of trials run. The first trial folder is named 0, the second trial folder is named 1, and so on. You can check this out at the trial.json file, too.

trial_folder

config.yaml

The YAML file you used for this trial.

Tip

You can see a sample full config.yaml.

[trial] summary.csv

Full trial summary csv file

There are node lines, selected modules, files and parameters used by the selected modules, and the time it took to process in one row.

trail_summary

pre_retrieve_node_line

node_line_folder

[Node Line] summary.csv

node_line_summary

Contains the best modules and settings selected from each node. You can see the node, the selected modules, their files and parameters used, and the time it took to process a row.

See also

Need to know what to do with Node Line? Check out Roadmap to Modular RAG.

query_expansion

Node names belonging to the node_line

node_folder

Depending on the module and module params, you can run different experiments on a node. The following image shows three experiments on a node.

  • 0.parquet

  • 1.parquet

  • best_(index).parquet ⇒ Top results on a node

Tip

In the image, the first result is the best of the three experiments, so the best file name is best_0.

[Node] summary.csv

Results for each node. All attempts and evaluation metric results are recorded.

node_summary

retrieve_node_line

Attention

All other node lines and nodes are organized in the same format as above. It would be too long to explain it all, but we hope you understand the structure.

data

data_folder

  • corpus.parquet ⇒ corpus dataset

  • qa.parquet ⇒ qa dataset

    Tip

    QA data can exist only as qa.parquet, but it is recommended to split it into train and test for more accurate optimization. Check out here for how to build a qa dataset and corpus dataset.

resources

resources_folder

  • bm25.pkl: created when using bm25

  • chroma: created when using vectordb

    • collection_name = the name of the embedding model

trial.json

It contains information about each trial.

trial_json