Error in converting from HTML to docx as getting unicodes

I have a project where I have to convert a PDF to Docx, then do some format changes, and then convert HTML to Docx. I cant to PDF to Docx due to intermediate format changes. PDF to HTML is working perfect. I am getting unicodes in my Docs when I convert the HTML to Docx. Can you tell why is it coming so?

@ans11

We will appreciate it if you please share your input HTML and output DOCX as a zip file. It will help us to understand the exact issue and address it.

1 Like

demo11.zip (3.3 MB)

I uploaded a PDF and I converted the first 10 pages to HTML and then that HTML to Docx. So I have uploaded the HTML and Docx for the same. We are interested t purchase the premium plan if this works well and good.

@ans11

Thanks for sharing the sample HTML file. We have converted the HTML to DOCX both with GroupDocs.Conversion Cloud and MS Word. Both are generating similar issues. It seems some font related issue. We will appreciate it if you please share your input PDF document and custom fonts as well. We will further investigate the issue and share our findings with you.

Attached the PDF. Regarding the font thing, we expected that when we convert HTML to DOCX it should retain the fonts as the HTML has a beautiful output in the browser.

We checked the fonts in the Docx which has the base font and a subset of it like Nudi04e + RSTYGF something like this. Can you let us know why the fonts are not loading?4th-language-savikannada-1-1-51-1-10.pdf (1.7 MB)

@ans11

Thanks for sharing your input PDF document. We have logged a ticket CONVERSIONCLOUD-396 for further investigation and rectification. We will complete the investigation asap and will update you.

1 Like

Hi, @ans11. Sorry for delayed answer. This issue cannot be fixed in GroupDocs.Conversion Cloud. Here is why:

Limitation of Aspose.HTML when converting HTML to DOCX.
Aspose.HTML does not support correct embedding of legacy (non-Unicode) fonts when exporting HTML to DOCX.
In particular, fonts are not supported that:

  1. Do not contain a Unicode cmap table (platform ID 3, encoding ID 1 or 10)
  2. Rely on legacy encodings (e.g. Nudi/Baraha), where Kannada glyphs are mapped to Latin code points
  3. Render correctly in browsers and PDF output but require direct glyph-ID rendering rather than Unicode text mapping.

During HTML → DOCX conversion, Aspose.HTML:

  • reconstructs embedded fonts into word/fonts/*.odttf
  • produces font files that are invalid for Microsoft Word (missing cmap and OS/2 tables)
  • causes Microsoft Word to ignore the embedded fonts and fall back to a default system font (typically Calibri)
  • which results in garbled text instead of correct Kannada glyphs.

In other words, Microsoft Word requires Unicode text and Unicode-compatible fonts. Legacy (non-Unicode) fonts are not supported, even if they are embedded in the document or installed on the system.