Indexing PDF files in Azure Web Apps


Description

Prior to Sitecore XP 9.1 Initial Release the media indexing feature was based on iFilters, which are not supported in Azure Web Apps. As a result, the following exception is thrown when indexing a media item, which contains a blob with PDF file:
ERROR Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{8EEE161B-F7D1-4339-AE77-1FA10B8CF8D2}?lang=en&ver=1
Exception: System.Runtime.InteropServices.COMException
Message: Exception from HRESULT: 0x80048605
Source: Sitecore.ContentSearch
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IPersistStream.Load(IStream stream)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.InitializeFilterAsPersistStream(IFilter filter, String fileName)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension)
   at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName)
   at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable)
   at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable)
   at Sitecore.ContentSearch.Azure.CloudSearchDocumentBuilder.AddComputedIndexFields()

Solution

To resolve the issue, consider one of the following options:

Note

Starting from Sitecore XP 9.1 Initial Release content of PDF files is extracted using PDFsharp third-party library. The library has some limitations that may lead to Hexadecimal value is an invalid character during media indexing.