Using Amazon Kendra S3 Connector

Amazon Kendra offers a connector for Amazon S3 that allows document ingestion to Kendra using S3 as the data source.

One of the features of the Amazon S3 connector is the ability to ingest attributes (metadata) associated with the original document. For example you could define attributes such as the document title, the category or the published date.

These attributes will allow you to improve your search results through filtering (pre-search query refinement), boosting (adjust search result ranking), and faceting (post-results refinement), which will be reviewed in a different section. Attributes can also be used to refer to the original document location which is the case of the attribute _source_uri.

For this exercise, you are going to index a set of AWS whitepapers in PDF format, these documents are stored on different directories within the archive depending on their category, for example Best_Practices.

The dataset can be downloaded here: AWS_Whitepapers.zip

Decompress and upload these documents to an S3 bucket to ingest them to Amazon Kendra and explore what kind of results you can obtain. You can upload the documents as they are when uncompressed, so the S3 bucket has the same structure as the original directory. Kendra will be able to process the sub folders in S3.

Also it is important to mention that you can define which S3 prefixes can be included or excluded from crawling on the “Additional options” section.