API

Todo

Add a nice introductory text

tira.io_utils module

tira.io_utils.all_environment_variables_for_github_action_or_fail(params)[source]
tira.io_utils.all_lines_to_pandas(input_file: str | Iterable[str], load_default_text: bool) DataFrame[source]

Todo

add documentation

Todo

this function has two semantics: handling a file and handling file-contents

tira.io_utils.load_output_of_directory(directory: Path, evaluation: bool = False) Dict | DataFrame[source]
tira.io_utils.parse_jsonl_line(input: str | bytearray | bytes, load_default_text: bool) Dict[source]

Deseralizes the line using JSON deserialization. Optionally strips the ‘original_query’ and ‘original_document’ fields from the resulting object and converts the qid and docno fields to strings.

Parameters:
  • input (str | bytearray | bytes) – A json-serialized string.

  • load_default_text (bool) – If true, the original_query and original_document fields are removed and the qid and docno values are converted to strings.

Returns:

The deserialized and (optionally) processed object.

Return type:

dict

Example:
>>> parse_jsonl_line('{}', False)
{}
>>> parse_jsonl_line('{"original_query": "xxxx"}', False)
{'original_query': 'xxxx'}
>>> parse_jsonl_line('{"original_query": "xxxx"}', True)
{}
>>> parse_jsonl_line('{"original_query": "xxxx", "qid": 42, "pi": 3.14}', False)
{'original_query': 'xxxx', 'qid': 42, 'pi': 3.14}
>>> parse_jsonl_line('{"original_query": "xxxx", "qid": 42, "pi": 3.14}', True)
{'qid': '42', 'pi': 3.14}
tira.io_utils.parse_prototext_key_values(file_name)[source]
tira.io_utils.run_cmd(cmd: List[str], ignore_failure=False)[source]
tira.io_utils.stream_all_lines(input_file: str | Iterable[bytes], load_default_text: bool) Generator[Dict, Any, Any][source]

Todo

add documentation

Todo

this function has two semantics: handling a file and handling file-contents

tira.io_utils.to_prototext(m: List[Dict[str, Any]], upper_k: str = '') str[source]

tira.ir_datasets_util module

class tira.ir_datasets_util.DictDocsstore(docs)[source]

Bases: object

get(item)[source]
get_many_iter(docids)[source]
class tira.ir_datasets_util.TirexQuery(query_id, text, title, query, description, narrative)[source]

Bases: NamedTuple

default_text()[source]

title

description: str

Alias for field number 4

narrative: str

Alias for field number 5

query: str

Alias for field number 3

query_id: str

Alias for field number 0

text: str

Alias for field number 1

title: str

Alias for field number 2

tira.ir_datasets_util.ir_dataset_from_tira_fallback_to_original_ir_datasets()[source]
tira.ir_datasets_util.register_dataset_from_re_rank_file(ir_dataset_id, df_re_rank, original_ir_datasets_id=None)[source]

Load a dynamic ir_datasets integration from a given re_rank_file. The dataset will be registered for the id ir_dataset_id. The original_ir_datasets_id is used to infer the class of documents, qrels, and queries.

tira.ir_datasets_util.static_ir_dataset(directory, existing_ir_dataset=None)[source]
tira.ir_datasets_util.translate_irds_id_to_tirex(dataset)[source]

tira.pyterrier_integration module

class tira.pyterrier_integration.PyTerrierAnceIntegration(tira_client)[source]

Bases: object

The pyterrier_ance integration to re-use cached ANCE indices. Wraps https://github.com/terrierteam/pyterrier_ance

ance_retrieval(dataset: str)[source]

Load a cached pyterrier_ance.ANCEIndexer submitted as workshop-on-open-web-search/ows/pyterrier-anceindex from tira.

References (for citation):

https://arxiv.org/pdf/2007.00808.pdf https://github.com/microsoft/ANCE/

Args:

dataset (str): the dataset id, either an tira or ir_datasets id.

Returns:

pyterrier_ance.ANCERetrieval: the ANCE index.

class tira.pyterrier_integration.PyTerrierIntegration(tira_client)[source]

Bases: object

create_rerank_file(run_df=None, run_file=None, irds_dataset_id=None)[source]
doc_features(approach, dataset, file_selection=('/*.jsonl', '/*.jsonl.gz'))[source]
ensure_dataset_is_cached(irds_dataset_id, dataset)[source]
from_retriever_submission(approach, dataset, previous_stage=None, datasets=None)[source]
from_submission(approach, dataset=None, datasets=None)[source]
index(approach, dataset)[source]

Load an PyTerrier index from TIRA.

query_features(approach, dataset, file_selection=('/*.jsonl', '/*.jsonl.gz'))[source]
reranker(approach, irds_id=None)[source]
retriever(approach, dataset=None)[source]
transform_documents(approach, dataset, file_selection=('/*.jsonl', '/*.jsonl.gz'), prefix='')[source]
transform_queries(approach, dataset, file_selection=('/*.jsonl', '/*.jsonl.gz'), prefix='')[source]
class tira.pyterrier_integration.PyTerrierSpladeIntegration(tira_client)[source]

Bases: object

The pyt_splade integration to re-use cached Splade indices. Wraps https://github.com/cmacdonald/pyt_splade

splade_index(dataset: str, approach: str = 'workshop-on-open-web-search/naverlabseurope/Splade (Index)')[source]

Load a cached pyt_splade index submitted as the passed approach (default ‘workshop-on-open-web-search/naverlabseurope/Splade (Index)’) from tira.

References (for citation):

https://github.com/naver/splade?tab=readme-ov-file#cite-scroll ToDo: Ask Thibault what to cite.

Args:

dataset (str): the dataset id, either an tira or ir_datasets id. approach (str, optional): the approach id, defaults ‘workshop-on-open-web-search/naverlabseurope/Splade (Index)’.

Returns:

The PyTerrier index suitable for retrieval.

tira.pyterrier_util module

tira.third_party_integrations module

tira.third_party_integrations.ensure_pyterrier_is_loaded(boot_packages=('com.github.terrierteam:terrier-prf:-SNAPSHOT',), packages=(), patch_ir_datasets=True)[source]
tira.third_party_integrations.extract_ast_value(v)[source]
tira.third_party_integrations.extract_previous_stages_from_docker_image(image: str, command: str | None = None)[source]
tira.third_party_integrations.extract_previous_stages_from_notebook(notebook: Path)[source]
tira.third_party_integrations.extract_to_be_executed_notebook_from_command_or_none(command: str)[source]
tira.third_party_integrations.get_input_directory_and_output_directory(default_input, default_output: str = '/tmp/')[source]
tira.third_party_integrations.get_output_directory(default_output: str = '/tmp/')[source]
tira.third_party_integrations.get_preconfigured_chatnoir_client(config_directory, features=['TARGET_URI'], num_results=10, retries=25, page_size=10)[source]
tira.third_party_integrations.is_running_as_inference_server()[source]
tira.third_party_integrations.load_ir_datasets()[source]
tira.third_party_integrations.load_rerank_data(default, load_default_text=True)[source]
tira.third_party_integrations.normalize_run(run, system_name, depth=1000)[source]
tira.third_party_integrations.parse_ast_extract_assignment(python_line: str)[source]
tira.third_party_integrations.parse_extraction_of_tira_approach(python_line: str)[source]
tira.third_party_integrations.parse_extraction_of_tira_approach_bash(bash_line: str)[source]
tira.third_party_integrations.persist_and_normalize_run(run, system_name, default_output=None, output_file=None, depth=1000)[source]
tira.third_party_integrations.register_rerank_data_to_ir_datasets(path_to_rerank_file, ir_dataset_id, original_ir_datasets_id=None)[source]

Load a dynamic ir_datasets integration from a given re_rank_file. The dataset will be registered for the id ir_dataset_id. The original_ir_datasets_id is used to infer the class of documents, qrels, and queries.

Module contents