本文是小编为大家收集整理的关于从字符串中删除非ASCII不可打印字符的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
我获得了用户输入,包括非ASCII字符和不可打印字符,例如
\xc2d \xa0 \xe7 \xc3\ufffdd \xc3\ufffdd \xc2\xa0 \xc3\xa7 \xa0\xa0
例如:
email : abc@gmail.com\xa0\xa0 street : 123 Main St.\xc2\xa0
所需的输出:
email : abc@gmail.com street : 123 Main St.
使用Java删除它们的最佳方法是什么?
我尝试了以下内容,但似乎不起作用
public static void main(String args[]) throws UnsupportedEncodingException { String s = "abc@gmail\\xe9.com"; String email = "abc@gmail.com\\xa0\\xa0"; System.out.println(s.replaceAll("\\P{Print}", "")); System.out.println(email.replaceAll("\\P{Print}", "")); }
输出
abc@gmail\xe9.com abc@gmail.com\xa0\xa0
推荐答案
您的要求尚不清楚. Java String中的所有字符都是Unicode字符,因此,如果您删除它们,则将留下一个空字符串.我认为您的意思是您要删除任何非ASCII,不可打印的字符.
String clean = str.replaceAll("\\P{Print}", "");
在这里,\p{Print} 代表POSIX字符类对于可打印的ASCII字符,而\P{Print}是该类的补充.使用此表达式,所有不 可打印的ASCII的字符都被空字符串替换. (额外的后斜线是因为\在字符串文字中启动了一个逃脱序列.)
显然,所有输入字符实际上都是ASCII字符,代表了不可打印或非ASCII字符的可打印编码. Mongo不应该在这些字符串上遇到任何麻烦,因为它们仅包含可打印的ASCII字符.
这对我来说听起来有些腥.我认为正在发生的事情是,数据确实包含不可打印和非ASCII字符,而另一个组件(如记录框架)是用可打印的表示代替这些.在简单的测试中,您无法将可打印的表示形式转换回原始字符串,因此您错误地认为第一个正则表达式不起作用.
这是我的猜测,但是如果我误读了这种情况,并且您确实需要剥离字面\xHH逃逸,那么您可以使用以下正则表达式进行操作.
String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");
Pattern Pattern 班级在列出Java Regex库支持的所有语法方面做得很好.要获得有关所有语法含义的更多详细说明,我找到了常规 - expressions.info网站非常非常有帮助的.
其他推荐答案
with google guava 's CharMatcher ,您可以删除任何
不确定这是否是您真正想要的,但是它删除了问题的样本数据中表达为逃生序列的任何内容. 我知道可能很晚,但是以后参考: 删除所有不可打印的字符,但其中包括\n(线feed),\t(tab)和\r(马车返回),有时您想保留这些字符. 对于这个问题,使用倒置逻辑:String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);
其他推荐答案
String clean = str.replaceAll("\\P{Print}", "");
String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");
问题描述
I get user input including non-ASCII characters and non-printable characters, such as
\xc2d \xa0 \xe7 \xc3\ufffdd \xc3\ufffdd \xc2\xa0 \xc3\xa7 \xa0\xa0
for example:
email : abc@gmail.com\xa0\xa0 street : 123 Main St.\xc2\xa0
desired output:
email : abc@gmail.com street : 123 Main St.
What is the best way to removing them using Java?
I tried the following, but doesn't seem to work
public static void main(String args[]) throws UnsupportedEncodingException { String s = "abc@gmail\\xe9.com"; String email = "abc@gmail.com\\xa0\\xa0"; System.out.println(s.replaceAll("\\P{Print}", "")); System.out.println(email.replaceAll("\\P{Print}", "")); }
Output
abc@gmail\xe9.com abc@gmail.com\xa0\xa0
推荐答案
Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.
String clean = str.replaceAll("\\P{Print}", "");
Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)
Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.
This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.
That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.
String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");
The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.
其他推荐答案
With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:
String printable = CharMatcher.INVISIBLE.removeFrom(input); String clean = CharMatcher.ASCII.retainFrom(printable);
Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.
其他推荐答案
I know it's maybe late but for future reference:
String clean = str.replaceAll("\\P{Print}", "");
Removes all non printable characters, but that includes \n (line feed), \t(tab) and \r(carriage return), and sometimes you want to keep those characters.
For that problem use inverted logic:
String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");