autorag package

Subpackages

Submodules

autorag.chunker module

class autorag.chunker.Chunker(raw_df: DataFrame, project_dir: str | None = None)[source]

Bases: object

classmethod from_parquet(parsed_data_path: str, project_dir: str | None = None) Chunker[source]
start_chunking(yaml_path: str)[source]

autorag.cli module

autorag.dashboard module

autorag.dashboard.find_node_dir(trial_dir: str) List[str][source]
autorag.dashboard.get_metric_values(node_summary_df: DataFrame) Dict[source]
autorag.dashboard.make_trial_summary_md(trial_dir)[source]
autorag.dashboard.node_view(node_dir: str)[source]
autorag.dashboard.run(trial_dir: str)[source]
autorag.dashboard.yaml_to_markdown(yaml_filepath)[source]

autorag.deploy module

class autorag.deploy.Runner(config: Dict, project_dir: str | None = None)[source]

Bases: object

classmethod from_trial_folder(trial_path: str)[source]

Load Runner from evaluated trial folder. Must already be evaluated using Evaluator class. It sets the project_dir as the parent directory of the trial folder.

Parameters:

trial_path – The path of the trial folder.

Returns:

Initialized Runner.

classmethod from_yaml(yaml_path: str, project_dir: str | None = None)[source]

Load Runner from yaml file. Must be extracted yaml file from evaluated trial using extract_best_config method.

Parameters:
  • yaml_path – The path of the yaml file.

  • project_dir – The path of the project directory. Default is the current directory.

Returns:

Initialized Runner.

run(query: str, result_column: str = 'generated_texts')[source]

Run the pipeline with query. The loaded pipeline must start with a single query, so the first module of the pipeline must be query_expansion or retrieval module.

Parameters:
  • query – The query of the user.

  • result_column – The result column name for the answer. Default is generated_texts, which is the output of the generation module.

Returns:

The result of the pipeline.

run_api_server(host: str = '0.0.0.0', port: int = 8000, **kwargs)[source]

Run the pipeline as api server. You can send POST request to http://host:port/run with json body like below:

{
    "query": "your query",
    "result_column": "generated_texts"
}

And it returns json response like below:

{
    "answer": "your answer"
}
Parameters:
  • host – The host of the api server.

  • port – The port of the api server.

  • kwargs – Other arguments for Flask app.run.

run_web(server_name: str = '0.0.0.0', server_port: int = 7680, share: bool = False, **kwargs)[source]

Run web interface to interact pipeline. You can access the web interface at http://server_name:server_port in your browser

Parameters:
  • server_name – The host of the web. Default is 0.0.0.0.

  • server_port – The port of the web. Default is 7680.

  • share – Whether to create a publicly shareable link. Default is False.

  • kwargs – Other arguments for gr.ChatInterface.launch.

class autorag.deploy.RunnerInput(*, query: str, result_column: str = 'generated_texts')[source]

Bases: BaseModel

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'query': FieldInfo(annotation=str, required=True), 'result_column': FieldInfo(annotation=str, required=False, default='generated_texts')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

query: str
result_column: str
autorag.deploy.extract_best_config(trial_path: str, output_path: str | None = None) Dict[source]

Extract the optimal pipeline from evaluated trial.

Parameters:
  • trial_path – The path to the trial directory that you want to extract the pipeline from. Must already be evaluated.

  • output_path – Output path that pipeline yaml file will be saved. Must be .yaml or .yml file. If None, it does not save yaml file and just return dict values. Default is None.

Returns:

The dictionary of the extracted pipeline.

autorag.deploy.extract_node_line_names(config_dict: Dict) List[str][source]

Extract node line names with the given config dictionary order.

Parameters:

config_dict – The yaml configuration dict for the pipeline. You can load this to access trail_folder/config.yaml.

Returns:

The list of node line names. It is the order of the node line names in the pipeline.

autorag.deploy.extract_node_strategy(config_dict: Dict) Dict[source]

Extract node strategies with the given config dictionary. The return value is a dictionary of node type and its strategy.

Parameters:

config_dict – The yaml configuration dict for the pipeline. You can load this to access trail_folder/config.yaml.

Returns:

Key is node_type and value is strategy dict.

autorag.deploy.summary_df_to_yaml(summary_df: DataFrame, config_dict: Dict) Dict[source]

Convert trial summary dataframe to config yaml file.

Parameters:
  • summary_df – The trial summary dataframe of the evaluated trial.

  • config_dict – The yaml configuration dict for the pipeline. You can load this to access trail_folder/config.yaml.

Returns:

Dictionary of config yaml file. You can save this dictionary to yaml file.

autorag.evaluator module

class autorag.evaluator.Evaluator(qa_data_path: str, corpus_data_path: str, project_dir: str | None = None)[source]

Bases: object

restart_trial(trial_path: str)[source]
start_trial(yaml_path: str)[source]

autorag.node_line module

autorag.node_line.make_node_lines(node_line_dict: Dict) List[Node][source]

This method makes a list of nodes from node line dictionary. :param node_line_dict: Node_line_dict loaded from yaml file, or get from user input. :return: List of Nodes inside this node line.

autorag.node_line.run_node_line(nodes: List[Node], node_line_dir: str, previous_result: DataFrame | None = None)[source]

Run the whole node line by running each node.

Parameters:
  • nodes – A list of nodes.

  • node_line_dir – This node line’s directory.

  • previous_result – A result of the previous node line. If None, it loads qa data from data/qa.parquet.

Returns:

The final result of the node line.

autorag.parser module

autorag.strategy module

autorag.strategy.avoid_empty_result(return_index: List[int])[source]

Decorator for avoiding empty results from the function. When the func returns an empty result, it will return the origin results. When the func returns a None, it will return the origin results. When the return value is a tuple, it will check all the value or list is empty. If so, it will return the origin results. It keeps parameters at return_index of the function as the origin results.

Parameters:

return_index – The index of the result to be returned when there is no result.

Returns:

The origin results or the results from the function.

autorag.strategy.filter_by_threshold(results, value, threshold, metadatas=None) Tuple[List, List][source]

Filter results by value’s threshold.

Parameters:
  • results – The result list to be filtered.

  • value – The value list to be filtered. It must have the same length with results.

  • threshold – The threshold value.

  • metadatas – The metadata of each result.

Returns:

Filtered list of results and filtered list of metadatas. Metadatas will be returned even if you did not give input metadatas.

Return type:

Tuple[List, List]

autorag.strategy.measure_speed(func, *args, **kwargs)[source]

Method for measuring execution speed of the function.

autorag.strategy.select_best(results: List[DataFrame], columns: Iterable[str], metadatas: List[Any] | None = None, strategy_name: str = 'mean') Tuple[DataFrame, Any][source]
autorag.strategy.select_best_average(results: List[DataFrame], columns: Iterable[str], metadatas: List[Any] | None = None) Tuple[DataFrame, Any][source]

Select the best result by average value among given columns.

Parameters:
  • results – The list of results. Each result must be pd.DataFrame.

  • columns – Column names to be averaged. Standard to select the best result.

  • metadatas – The metadata of each result. It will select one metadata with the best result.

Returns:

The best result and the best metadata. The metadata will be returned even if you did not give input ‘metadatas’ parameter.

Return type:

Tuple[pd.DataFrame, Any]

autorag.strategy.select_best_rr(results: List[DataFrame], columns: Iterable[str], metadatas: List[Any] | None = None) Tuple[DataFrame, Any][source]
autorag.strategy.select_normalize_mean(results: List[DataFrame], columns: Iterable[str], metadatas: List[Any] | None = None) Tuple[DataFrame, Any][source]
autorag.strategy.validate_strategy_inputs(results: List[DataFrame], columns: Iterable[str], metadatas: List[Any] | None = None)[source]

autorag.support module

autorag.support.dynamically_find_function(key: str, target_dict: Dict) Callable[source]
autorag.support.get_support_modules(module_name: str) Callable[source]
autorag.support.get_support_nodes(node_name: str) Callable[source]

autorag.validator module

class autorag.validator.Validator(qa_data_path: str, corpus_data_path: str)[source]

Bases: object

validate(yaml_path: str, qa_cnt: int = 5, random_state: int = 42)[source]

autorag.web module

autorag.web.chat_box(runner: Runner)[source]
autorag.web.get_runner(yaml_path: str | None, project_dir: str | None, trial_path: str | None)[source]
autorag.web.set_initial_state()[source]
autorag.web.set_page_config()[source]
autorag.web.set_page_header()[source]

Module contents

class autorag.LazyInit(factory, *args, **kwargs)[source]

Bases: object

class autorag.MockEmbeddingRandom(embed_dim: int, *, model_name: str = 'unknown', embed_batch_size: Annotated[int, Gt(gt=0), Le(le=2048)] = 10, callback_manager: CallbackManager = None, num_workers: int | None = None)[source]

Bases: MockEmbedding

Mock embedding with random vectors.

embed_dim: int
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'protected_namespaces': ('pydantic_model_',)}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'callback_manager': FieldInfo(annotation=CallbackManager, required=False, default_factory=<lambda>, exclude=True), 'embed_batch_size': FieldInfo(annotation=int, required=False, default=10, description='The batch size for embedding calls.', metadata=[Gt(gt=0), Le(le=2048)]), 'embed_dim': FieldInfo(annotation=int, required=True), 'model_name': FieldInfo(annotation=str, required=False, default='unknown', description='The name of the embedding model.'), 'num_workers': FieldInfo(annotation=Union[int, NoneType], required=False, default=None, description='The number of workers to use for async embedding calls.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

autorag.handle_exception(exc_type, exc_value, exc_traceback)[source]
autorag.random() x in the interval [0, 1).