UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'(写入PDF)。[英] UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' (writing to PDF)

本文是小编为大家收集整理的关于UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'(写入PDF)。的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我在用python写入.pdf时,与Unicode有一个可变内容.

它正在输出此错误:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'

这基本上被抓住了EM破折号.

我尝试采用该变量,其中内容具有" em dash",并用'.encode('utf-8')'重新定义它,例如,下面:

Body = msg.Body

BodyC = Body.encode('utf-8')

现在我得到了以下错误:

Traceback (most recent call last):
  File "script.py", line 37, in <module>
    pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")
TypeError: can only concatenate str (not "bytes") to str

下面是我的完整代码,我如何简单地将我的Unicode错误固定在'Body'变量内容中.

转换为utf-8或western,'latin-1'之外的任何东西.有任何建议吗?

完整代码:

from fpdf import FPDF
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\User\language\python\Msg-To-PDF\test_msg.msg")

print (msg.SenderName)
print (msg.SenderEmailAddress)
print (msg.SentOn)
print (msg.To)
print (msg.CC)
print (msg.BCC)
print (msg.Subject)
print (msg.Body)

SenderName = msg.SenderName
SenderEmailAddress = msg.SenderEmailAddress
SentOn = msg.SentOn
To = msg.To
CC = msg.CC
BCC = msg.BCC
Subject = msg.Subject
Body = msg.Body
BodyC = Body.encode('utf-8')

pdf = FPDF()
pdf.add_page()

# pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
pdf.set_font("Helvetica", style = '', size = 11)
pdf.cell(200, 10, txt="From: " + SenderName, ln=1, align="C")
# pdf.cell(200, 10, border=SentOn, ln=1, align="C")
pdf.cell(200, 10, txt="To: " + To, ln=1, align="C")
pdf.cell(200, 10, txt="CC: " + CC, ln=1, align="C")
pdf.cell(200, 10, txt="BCC: " + BCC, ln=1, align="C")
pdf.cell(200, 10, txt="Subject: " + Subject, ln=1, align="C")
pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")

pdf.output("Sample.pdf")
  • 如何从'latin1'?
  • 中更改

  • 无论如何,只需解决这些问题吗?

推荐答案

解决方法是将所有文本转换为Latin-1编码,然后再将其传递到库.您可以使用以下命令来执行此操作:

text2 = text.encode('latin-1', 'replace').decode('latin-1')

text2将不含任何非LATIN-1字符.但是,有些字符可以用?

代替

其他推荐答案

此错误的原因是您试图在PDF中渲染一个字符,该字符超出了latin-1编码的代码范围. FPDF使用latin-1作为其所有构建字体的默认编码.

作为解决方法,您只需从文本中删除所有不适合latin-1编码的字符即可. (请参阅我的其他答案).

要解决此错误并能够在PDF中渲染这些字符,您需要使用支持更广泛字符的字体.为了解决这个问题,FPDF库支持Unicode字体.

例如,您可以免费获得 google noto fonts Unicode端点.对于大多数西方语言,我会建议Notosans字体集.但是您也可以获取许多其他语言和脚本的字体,包括中文,希伯来语或阿拉伯语.

这是如何启用FPDF代码中的Unicode字体:

首先,您需要告诉FPDF库可以找到字体文件.在此示例中,我将其设置为当前文件夹的子文件夹fonts.

import fpdf
fpdf.set_global("SYSTEM_TTFONTS", os.path.join(os.path.dirname(__file__),'fonts'))

然后,您需要将字体添加到PDF文档中.在此示例中

pdf = fpdf.FPDF()
pdf.add_font("NotoSans", style="", fname="NotoSans-Regular.ttf", uni=True)
pdf.add_font("NotoSans", style="B", fname="NotoSans-Bold.ttf", uni=True)
pdf.add_font("NotoSans", style="I", fname="NotoSans-Italic.ttf", uni=True)
pdf.add_font("NotoSans", style="BI", fname="NotoSans-BoldItalic.ttf", uni=True)

现在,您可以使用set_font()中的PDF文档中通常使用新字体.这是普通文本的示例:

pdf.set_font("NotoSans", size=12)

其他推荐答案

您还可以通过.set_doc_option()方法更改编码(documentation 在这里).我尝试了Erik的方法,它对我有用,但是在添加了一些更复杂的情况(例如第二个PDF和使用Write_html()方法需要创建新类)之后,我又回到了相同的错误.更改整个文档的编码应如您所说的那样解决总体问题.

ReadThedocs页面说您只能使用Latin-1或Windows-1252,但是pdf.set_doc_option('core_fonts_encoding', 'utf-8')根据调试器为我工作.请注意,某些字符需要修复,例如postrophe(')在PDF中显示为â€âtm.

希望这是您要寻找的这个问题的全球修复,即使迟到了几个月!

本文地址:https://www.itbaoku.cn/post/2090930.html

问题描述

I am having an issue with Unicode with a variable contents when writing to a .pdf with python.

It's outputting this error:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'

Which is it getting caught on an em dash basically.

I have tried taking that variable, where the contents has an 'em dash' and redefined it with an '.encode('utf-8')' for example, i.e., below:

Body = msg.Body

BodyC = Body.encode('utf-8')

And now I get the below error:

Traceback (most recent call last):
  File "script.py", line 37, in <module>
    pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")
TypeError: can only concatenate str (not "bytes") to str

Below is my full code, how could I simply fix my Unicode error in 'Body' variable contents.

Converting to utf-8 or western, anything outside of 'latin-1'. Any suggestions?

Full Code:

from fpdf import FPDF
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\User\language\python\Msg-To-PDF\test_msg.msg")

print (msg.SenderName)
print (msg.SenderEmailAddress)
print (msg.SentOn)
print (msg.To)
print (msg.CC)
print (msg.BCC)
print (msg.Subject)
print (msg.Body)

SenderName = msg.SenderName
SenderEmailAddress = msg.SenderEmailAddress
SentOn = msg.SentOn
To = msg.To
CC = msg.CC
BCC = msg.BCC
Subject = msg.Subject
Body = msg.Body
BodyC = Body.encode('utf-8')

pdf = FPDF()
pdf.add_page()

# pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
pdf.set_font("Helvetica", style = '', size = 11)
pdf.cell(200, 10, txt="From: " + SenderName, ln=1, align="C")
# pdf.cell(200, 10, border=SentOn, ln=1, align="C")
pdf.cell(200, 10, txt="To: " + To, ln=1, align="C")
pdf.cell(200, 10, txt="CC: " + CC, ln=1, align="C")
pdf.cell(200, 10, txt="BCC: " + BCC, ln=1, align="C")
pdf.cell(200, 10, txt="Subject: " + Subject, ln=1, align="C")
pdf.cell(200, 10, txt="Bod: " + BodyC,  ln=4, align="C")

pdf.output("Sample.pdf")
  • How can I change out of 'latin1'?

  • Anyway to just globally fix these issues?

推荐答案

A workaround is to convert all text to latin-1 encoding before passing it on to the library. You can do that with the following command:

text2 = text.encode('latin-1', 'replace').decode('latin-1')

text2 will be free of any non-latin-1 characters. However, some chars may be replaced with ?

其他推荐答案

The reason for this error is that you are trying to render a character in your PDF that is outside the code range of latin-1 encoding. FPDF uses latin-1 as default encoding for all its build-in fonts.

So as a workaround you can just remove all characters from your text that do not fit into latin-1 encoding. (see my other answer for this workaround).

To fix this error and be able to render those characters in your PDF you need to use fonts that support a wider range of characters. To address this the FPDF library supports Unicode font.

For example you could get the free Google Noto fonts, which support a wide range of Unicode endpoints. For most western languages I would recommend the NotoSans font set. But you can also get fonts for many other languages and scripts including Chinese, Hebrew or Arabic.

Here is how to enable the Unicode fonts in your code for FPDF:

First you need to tell FPDF library where it can find the font files. In this example I am setting it to the sub-folder fonts of the current folder.

import fpdf
fpdf.set_global("SYSTEM_TTFONTS", os.path.join(os.path.dirname(__file__),'fonts'))

Then you need to add the fonts to your PDF document. In this example I am adding the NotoSans fonts for the styles normal, bold, italic and bold-italic:

pdf = fpdf.FPDF()
pdf.add_font("NotoSans", style="", fname="NotoSans-Regular.ttf", uni=True)
pdf.add_font("NotoSans", style="B", fname="NotoSans-Bold.ttf", uni=True)
pdf.add_font("NotoSans", style="I", fname="NotoSans-Italic.ttf", uni=True)
pdf.add_font("NotoSans", style="BI", fname="NotoSans-BoldItalic.ttf", uni=True)

Now you can use the new fonts normally in your PDF document with set_font(). Here is an example for normal text:

pdf.set_font("NotoSans", size=12)

其他推荐答案

You can also change the encoding through the .set_doc_option() method (documentation here). I tried Erik's method, which worked for me, but then after adding some more complexities (such as a second PDF and using the write_html() method which required creating a new class), I went back to having the same error. Changing the encoding for the whole document should solve the overall problem as you said.

The readthedocs page says you can only use latin-1 or windows-1252, but pdf.set_doc_option('core_fonts_encoding', 'utf-8') worked for me according to the debugger. Just be aware that some characters will need fixing, like the apostrophe (') showing as â€ÂTM in the PDF.

Hope this is the global fix for this issue you were looking for, even if several months late!