Metadata Documents for the Amazon Kendra S3 Connector

Let’s pick up on the scenario we left our last section. Your search results may vary dramatically if you use an attribute for filtering, for example, analyze the results you can get depending on the facet you define on the search:

Using Well Architected as category:

Using Security as category:

In order to improve your results and to give your users the opportunity of filtering through attributes the documents, you are going to create a separate metadata document to go along with each of our PDF documents.

You can extract metadata automatically based on how the PDF was created. Some PDF documents include some metadata natively, typical examples of this can be the creation date, the title, author, etc. In our case I extracted the creation date and title where available. Additionally, a new attribute called category has been created based on the directory where the document was located. The category attribute can be useful for scenarios like github where there is a directory structure or also if you are manually curating documents it is easier to drop documents into a directory for parsing vs modifying metadata files manually.

You can download the AWS Whitepapers with metadata documents here.

This is one example:

{
 "Attributes": 
    {
        "_category": "Best Practices",
        "number_of_pages": 12,
        "_created_at": "2019-05-31T22:48:23Z"
        },
 "title": "Best Practices Design Patterns: Optimizing Amazon S3 Performance",
 "content_type": "PDF"
}

In this case the title, creation_date and number_of_pages have been extracted from the document’s metadata, the category was extracted from the directory where this file is stored.

Note: As it was mentioned before, an important attribute is _source_uri that can be used as the document location on the search result instead of the S3 object itself.