使用python/django从字符串中删除非ASCII字符[英] Remove non-ASCII characters from a string using python / django

本文是小编为大家收集整理的关于使用python/django从字符串中删除非ASCII字符的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我在数据库中存储了一串HTML.不幸的是它包含诸如®之类的字符 我想在DB本身或使用我的python/django代码中使用查找替换的HTML等效替换这些字符.

关于我如何做到这一点的任何建议?

推荐答案

您可以使用ASCII字符是前128个字符,因此请使用ord获得每个字符的数量,如果它不超出范围

,则将其剥离
# -*- coding: utf-8 -*-

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)

结果

éáé123456tgreáé@€
123456tgre@

请注意,包括@是因为,毕竟它是ASCII角色.如果您想剥离特定子集(例如数字,大写和小写字母),则可以限制范围 ascii "> ascii表

编辑:再次阅读您的问题后,也许您需要逃脱HTML代码,因此所有这些字符在渲染后都正确.您可以在模板上使用escape过滤器.

其他推荐答案

>

要从字符串中删除非ASCII字符,s,使用:

s = s.encode('ascii',errors='ignore')

然后使用:

将其从字节转换回字符串

s = s.decode()

这全部使用Python 3.6

其他推荐答案

我不久前发现了这一点,所以这并不是我的工作.我找不到来源,但这是我的代码中的片段.

def unicode_escape(unistr):
    """
    Tidys up unicode entities into HTML friendly entities

    Takes a unicode string as an argument

    Returns a unicode string
    """
    import htmlentitydefs
    escaped = ""

    for char in unistr:
        if ord(char) in htmlentitydefs.codepoint2name:
            name = htmlentitydefs.codepoint2name.get(ord(char))
            entity = htmlentitydefs.name2codepoint.get(name)
            escaped +="&#" + str(entity)

        else:
            escaped += char

    return escaped

像这样使用

>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

本文地址:https://www.itbaoku.cn/post/634778.html

问题描述

I have a string of HTML stored in a database. Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code.

Any suggestions on how I can do this?

推荐答案

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range

# -*- coding: utf-8 -*-

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)

Result

éáé123456tgreáé@€
123456tgre@

Please note that @ is included because, well, after all it's an ASCII character. If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table

EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. You can use the escape filter on your templates.

其他推荐答案

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481

To remove non-ASCII characters from a string, s, use:

s = s.encode('ascii',errors='ignore')

Then convert it from bytes back to a string using:

s = s.decode()

This all using Python 3.6

其他推荐答案

I found this a while ago, so this isn't in any way my work. I can't find the source, but here's the snippet from my code.

def unicode_escape(unistr):
    """
    Tidys up unicode entities into HTML friendly entities

    Takes a unicode string as an argument

    Returns a unicode string
    """
    import htmlentitydefs
    escaped = ""

    for char in unistr:
        if ord(char) in htmlentitydefs.codepoint2name:
            name = htmlentitydefs.codepoint2name.get(ord(char))
            entity = htmlentitydefs.name2codepoint.get(name)
            escaped +="&#" + str(entity)

        else:
            escaped += char

    return escaped

Use it like this

>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'