KItinerary

engine/extractorengine.h
1/*
2 SPDX-FileCopyrightText: 2017-2021 Volker Krause <vkrause@kde.org>
3
4 SPDX-License-Identifier: LGPL-2.0-or-later
5*/
6
7#pragma once
8
9#include "kitinerary_export.h"
10
11#include <QString>
12
13#include <memory>
14#include <vector>
15
16class QByteArray;
17class QDateTime;
18class QJsonArray;
19class QVariant;
20
21namespace KItinerary {
22
24class BarcodeDecoder;
27class ExtractorEnginePrivate;
29class ExtractorScriptEngine;
30
31/**
32 * Semantic data extraction engine.
33 *
34 * This will attempt to find travel itinerary data in the given input data
35 * (plain text, HTML text, PDF documents, etc), and return the extracted
36 * JSON-LD data.
37 *
38 * @section create_extractors Creating Extractors
39 *
40 * @subsection extractor_api Extractor API
41 *
42 * For adding custom extractors, two parts are needed:
43 * - JSON meta-data describing the extractor and when to apply it, as described
44 * in the Extractor documentation.
45 * - An extractor JavaScript file, compatible with QJSEngine.
46 *
47 * The extractor script will have access to API defined in the JsApi namespace:
48 * - JsApi::JsonLd: functions for generating JSON-LD data.
49 * - JsApi::Barcode: barcode decoding functions.
50 * - JsApi::BitArray, JsApi::ByteArray for working with binary data.
51 * - JsApi::ExtractorEngine for recursive invokation of the extractor process.
52 *
53 * The entry point to the script is specified in the meta-data, its argument depends
54 * on the extractor type:
55 * - Plain text extractors are passed a string.
56 * If input is HTML or PDF, the string will be the text of the document stripped
57 * of all formatting etc.
58 * - HTML extractors are passed a HtmlDocument instance allowing DOM-like access to
59 * the document structure.
60 * - PDF extractors are passed a PdfDocument instance allowing access to textual and
61 * image content.
62 * - Apple Wallet pass extractors are passed a KPkPass::BoardingPass instance.
63 * - iCalendar event extractors are passed KCalendarCore::Event instances.
64 * - UIC/ERA/VDV/IATA standardized ticket codes are passed as their respective types.
65 * - Binary data is passed as ArrayBuffer.
66 *
67 * These functions should return an object or an array of objects following the JSON-LD
68 * format defined on schema.org. JsApi::JsonLd provides helper functions to build such
69 * objects. If @c null or an empty array is returned, the next applicable extractor is
70 * run.
71 *
72 * Returned objects are then passed through ExtractorPostprocessor which will normalize,
73 * augment and validate the data. This can greatly simplify the extraction, as for example
74 * the expansion of an IATA BCBP ticket token already fills most key properties of a flight
75 * reservation automatically.
76 *
77 * @subsection extractor_tools Development Tools
78 *
79 * For interactive testing during development of new extractors, it is recommended to
80 * link (or copy) the JSON meta data and JavaScript code files to the search path for
81 * Extractor meta data.
82 *
83 * Additionally, there's an interactive testing and inspection tool called @c kitinerary-workbench
84 * (see https://invent.kde.org/pim/kitinerary-workbench).
85 *
86 * @subsection extractor_testing Automated Testing
87 *
88 * There are a few unit tests for extractors in the kitinerary repository (see autotests/extractordata),
89 * however the majority of real-world test data cannot be shared this way, due to privacy
90 * and copyright issues (e.g. PDFs containing copyrighted vendor logos and user credit card details).
91 * Therefore there is also support for testing against external data (see extractortest.cpp).
92 *
93 * External test data is assumed to be in a folder named @c kitinerary-tests next to the @c kitinerary
94 * source folder. The test program searches this folder recursively for folders with the following content
95 * and attempts to extract data from each test file in there.
96 *
97 * - @c context.eml: MIME message header data specifying the context in which the test data
98 * was received. This typically only needs a @c From: and @c Date: line, but can even be
99 * entirely empty (or non-existing) for structured data that does not need a custom extractor.
100 * This context information is applied to all tests in this folder.
101 * - @c <testname>.[txt|html|pdf|pkpass|ics|eml|mbox]: The input test data.
102 * - @c <testname.extension>.json: The expected JSON-LD output. If this file doesn't
103 * exists it is created by the test program.
104 * - @c <testname.extension>.skip: If this file is present the corresponding test
105 * is skipped.
106 */
107class KITINERARY_EXPORT ExtractorEngine
108{
109public:
110 ExtractorEngine();
111 ExtractorEngine(ExtractorEngine &&) noexcept;
112 ExtractorEngine(const ExtractorEngine &) = delete;
113 ~ExtractorEngine();
114
115 /** Resets the internal state, call before processing new input data. */
116 void clear();
117
118 /** Set raw data to extract from.
119 * @param data Raw data to extract from.
120 * @param fileName Used as a hint to determine the type, optional and used for MIME type auto-detection if needed.
121 * @param mimeType MIME type of @p data, auto-detected if empty.
122 */
123 void setData(const QByteArray &data, QStringView fileName = {}, QStringView mimeType = {});
124
125 /** Already decoded data to extract from.
126 * @param data Has to contain a object of a supported data type matching @p mimeType.
127 */
128 void setContent(const QVariant &data, QStringView mimeType);
129
130 /** Provide a document part that is only used to determine which extractor to use,
131 * but not for extraction itself.
132 * This can for example be the MIME message part wrapping a document to extract.
133 * Using this is not necessary when this document part is already included in
134 * what is passed to setContent() already anyway.
135 */
136 void setContext(const QVariant &data, QStringView mimeType);
137
138 /** Set the date the extracted document has been issued at.
139 * This does not need to be perfectly accurate and is used to
140 * complete incomplete date information in the document (typically
141 * a missing year).
142 * This method does not need to be called when setContext is used.
143 */
144 void setContextDate(const QDateTime &dt);
145
146 /** Perform extraction of "risky" content such as PDF files in a separate process.
147 * This is safer as it isolates the using application from crashes/hangs due to corrupt files.
148 * It is however slower, and not available on all platforms.
149 * This is off by default.
150 */
151 void setUseSeparateProcess(bool separateProcess);
152
153 /** Sets additional extractors to run on the given data.
154 * Extractors are usually automatically selected, this is therefore most likely not needed to
155 * be called manually. This mainly exists for the external extractor process.
156 */
157 void setAdditionalExtractors(std::vector<const AbstractExtractor*> &&extractors);
158
159 /** Hints about the document to extract based on application knowledge that
160 * can help the extractor.
161 */
162 enum Hint {
163 NoHint = 0,
164 ExtractFullPageRasterImages = 1, ///< perform expensive image processing on (PDF) documents containing full page raster images
165 ExtractGenericIcalEvents = 2, ///< generate Event objects for generic ical events.
166 };
167 Q_DECLARE_FLAGS(Hints, Hint)
168
169 /** The currently set extraction hints. */
170 Hints hints() const;
171 /** Set extraction hints. */
172 void setHints(Hints hints);
173
174 /** Perform the actual extraction, and return the JSON-LD data
175 * that has been found.
176 */
177 QJsonArray extract();
178
179 /** Returns the extractor id used to obtain the result.
180 * Can be empty if generic extractors have been used.
181 * Not supposed to be used for normal operations, this is only needed for tooling.
182 */
183 QString usedCustomExtractor() const;
184
185 /** Factory for creating new document nodes.
186 * This is only for use by KItinerary::ExtractorDocumentProcessor instances.
187 */
188 const ExtractorDocumentNodeFactory* documentNodeFactory() const;
189 /** Barcode decoder for use by KItinerary::ExtractorDocumentProcessor.
190 * Use this rather than your own instance as it caches repeated attempts to
191 * decode the same image.
192 */
193 const BarcodeDecoder* barcodeDecoder() const;
194
195 ///@cond internal
196 /** Extractor repository instance used by this engine. */
197 const ExtractorRepository* extractorRepository() const;
198 /** JavaScript execution engine for script extractors. */
199 const ExtractorScriptEngine* scriptEngine() const;
200 /** Document root node.
201 * Only fully populated after extraction has been performed.
202 * Only exposed for tooling.
203 */
204 ExtractorDocumentNode rootDocumentNode() const;
205 /** Process a single node.
206 * For use by the script engine, do not use manually.
207 */
208 void processNode(ExtractorDocumentNode &node) const;
209 ///@endcond
210
211private:
212 std::unique_ptr<ExtractorEnginePrivate> d;
213};
214
215Q_DECLARE_OPERATORS_FOR_FLAGS(ExtractorEngine::Hints)
216
217}
218
Abstract base class for data extractors.
Barcode decoding with result caching.
Instantiates KItinerary::ExtractorDocumentNode instances using the type-specific document processor.
A node in the extracted document object tree.
void setAdditionalExtractors(std::vector< const AbstractExtractor * > &&extractors)
Sets additional extractors to run on the given data.
Hint
Hints about the document to extract based on application knowledge that can help the extractor.
@ ExtractFullPageRasterImages
perform expensive image processing on (PDF) documents containing full page raster images
@ ExtractGenericIcalEvents
generate Event objects for generic ical events.
void setData(const QByteArray &data, QStringView fileName={}, QStringView mimeType={})
Set raw data to extract from.
void setContent(const QVariant &data, QStringView mimeType)
Already decoded data to extract from.
void clear()
Resets the internal state, call before processing new input data.
void setContextDate(const QDateTime &dt)
Set the date the extracted document has been issued at.
void setContext(const QVariant &data, QStringView mimeType)
Provide a document part that is only used to determine which extractor to use, but not for extraction...
void setUseSeparateProcess(bool separateProcess)
Perform extraction of "risky" content such as PDF files in a separate process.
Collection of all known data extractors.
Classes for reservation/travel data models, data extraction and data augmentation.
Definition berelement.h:17
This file is part of the KDE documentation.
Documentation copyright © 1996-2025 The KDE developers.
Generated on Fri Jan 24 2025 11:52:35 by doxygen 1.13.2 written by Dimitri van Heesch, © 1997-2006

KDE's Doxygen guidelines are available online.