在多行之间和包括字符的Regex[英] Regex between and including characters across multiple lines

本文是小编为大家收集整理的关于在多行之间和包括字符的Regex的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

本文来自:IT宝库(https://www.itbaoku.cn)

我有以下文本:

BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING VERIFICATION LETTER
Document Handle: 712826
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 261711
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558564.PDF
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING CODES COMPLIANCE LETTER
Document Handle: 712825
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 19441
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558563.pdf

我需要使用Regex(将在C#程序中进行)将其转换为对CSV有效的东西.最重要的数据是每个部分的文档句柄和文件名(路径)(是"开始:"的部分),我正在为其他人致力于此操作,因此我想尽可能保留他们认为他们需要其他一些数据的事件.这是我最初的尝试:

\r\n(?!BEGIN).*\:

但是,并非每个部分都有一个"临时:"组件,当将其拉入Excel时,它会从单元格排列.临时我确定不是最终结果所需的数据的一部分.

最好的情况是只选择并删除每个"临时"和"句柄:"之间的所有内容.然后,我将与上述正则态度一起管道.

我唯一的其他要求是所有这些都必须在一个正则陈述中 - 否则在我编写的程序中,我必须设置某种循环,或者我尚未准备做的业务.

推荐答案

根据我从问题下的评论中理解的内容,该问题中给出的示例数据应转换为这样的两行:

Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

要在避免循环的同时实现此结果(尽管我想知道您为什么要避免循环 - 它们是基本且全词的构造),我建议使用两个(或三个,请参见第3节). .


1.删除"标签:"并用";"

替换线路断裂

第一个正则表达式将在":"的前面删除标签,并用半隆的任何前面的线路断裂.但是,它将不是删除或更换" begin:"前面的换行站,也不会触摸" begin:"本身.

@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*"

正则表达可视化

此正则是两个正则反正(在上面的可视化中很容易看到)的或组合:

[\r\n]+\s*Ad\sHoc:.*?[\r\n]+.*?:\s*

将与临时匹配:"行包括任何"标签:"以下行中的字符串,

([\r\n]+(?!\s*BEGIN)).*?:\s*

将与任何"标签:"匹配,包括前面的折线,除了" begin:"标签.

将此正则态度应用于您的示例,并用""替换所有匹配项;将导致以下内容:

BEGIN:;Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
BEGIN:;Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

请注意"开始:"我们现在要照顾的.


2.消除"开始:"标签

查看第一个正则替换结果时,这是相当简单的模式.

"(?m)^BEGIN:;"

您可能认为您可以通过替换字符串来执行此操作 - 在编写答案的第一个版本时,我也可以这样做.但是,当" begin."可能是任何其他文本字段内容的一部分.最好通过指定仅在行开始时匹配的正则态度来正确和安全.


3.代码示例,包括消除源文本中的空线

如果您在源文本中包含白色空间的空线,则上面显示的正则表达式可能无法正常工作.解决方案是事先进行另一条正则替换,这将空线(包括白空间)减少到单个线路断开(如果您确定源数据不包含空线,则可以省略此步骤).

一个完整的代码示例,该示例将产生我答案开始时提到的结果,看起来像这样:

string sourceData = ... your text with the source data ...

Regex reEmptyLines = new Regex(@"[\s\r\n]+[\r\n]", RegexOptions.Compiled);
Regex reSemicolons = new Regex(@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*", RegexOptions.Compiled);
Regex reBegin = new Regex("(?m)^BEGIN:;", RegexOptions.Compiled);

string processed =
    reBegin.Replace(
        reSemicolons.Replace(
            reEmptyLines.Replace(sourceData, "\r\n"),
            ";"
        ),
        string.Empty
    );

其他推荐答案

您可以使用正则表达式,但我不会说这比在循环中手动进行更容易.

(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)

正则表达可视化

示例代码:

foreach (Match m in Regex.Matches(text, @"(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)"))
{
    Console.WriteLine(string.Join(",", m.Groups["value"].Captures.Cast<Capture>().Select(c => c.Value)));
}

输出:

Zoning Letter,4/16/2014,355211,712826,102,367,0,261711,0,1,0,0,16,0,2,0,0,\V367\2855\1558564.PDF
Zoning Letter,4/16/2014,355211,712825,102,367,0,19441,0,1,0,0,16,0,2,0,0,\V367\2855\1558563.pdf

其他推荐答案

这是如何:

BEGIN:((?:(?!BEGIN:).)*)

这将匹配第一个开始和下一个之间的所有内容.

本文地址:https://www.itbaoku.cn/post/2354213.html

问题描述

I have the below text:

BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING VERIFICATION LETTER
Document Handle: 712826
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 261711
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558564.PDF
BEGIN:
>>DocTypeName: Zoning Letter
>>DocDate: 4/16/2014
Loan Number: 355211
Ad Hoc: ZONING CODES COMPLIANCE LETTER
Document Handle: 712825
>>DiskgroupNum: 102
>>VolumeNum: 367
>>NumOfPages: 0
>>FileSize: 19441
>>DocRevNum: 0
>>Rendition: 1
>>PhysicalPageNum: 0
>>ItemPageNum: 0
>>FileTypeNum: 16
>>ImageType: 0
>>Compress: 2
>>Xdpi: 0
>>Ydpi: 0
>>FileName: \V367\2855\1558563.pdf

I need to use regex (which will go in a C# program) to convert this into something effective for a CSV. The data that is most vital is the document handle and filename (path) from each section (being a section under "BEGIN:") I'm working on this for someone else, so I'd like to retain as much as possible in the event they decide they need some of the other data. This was my initial attempt:

\r\n(?!BEGIN).*\:

However, not every section has an "Ad Hoc:" component, which throws off the cell alignment when pulled into Excel. Ad Hoc I know for sure is not part of the data that is needed for the end result.

The best case scenario would be to just select and remove everything between every "Ad Hoc" and "Handle:" to be replaced with the delimiter (;). I would then pipe this along with my above regex.

My only other requirement is that this has to all be in one regex statement - otherwise in the program I've written I'll have to set up some sort of loop or while business which I'm not prepared to do yet.

推荐答案

Based on what i understood from the comments underneath the question, the example data given in the question should be transformed into two text lines like this:

Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

To achieve this result while avoiding a loop (although i wonder why you would want to avoid loops - they are basic and omni-present constructs), i would suggest applying two (or three, see section 3. below) regex substitutions.


1. Removal of "Label:" and replacement of line breaks with ";"

The first regular expression will remove a label in front of ":" including ":" and any preceding line break with a semicolon. However, it will not remove or replace a line break in front of "BEGIN:", and neither will it touch the "BEGIN:" itself.

@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*"

Regular expression visualization

This regex is an OR-combination of two regex (which is easy to see in the visualization above):

[\r\n]+\s*Ad\sHoc:.*?[\r\n]+.*?:\s*

which will match Ad Hoc:" lines including any "Label:" string in the following line, and

([\r\n]+(?!\s*BEGIN)).*?:\s*

which will match any "Label:" including the line break in front of it, except for the "BEGIN:" label.

Applying this regex to your example and replacing all matches with ";" will result in the following:

BEGIN:;Zoning Letter;4/16/2014;355211;712826;102;367;0;261711;0;1;0;0;16;0;2;0;0;\V367\2855\1558564.PDF
BEGIN:;Zoning Letter;4/16/2014;355211;712825;102;367;0;19441;0;1;0;0;16;0;2;0;0;\V367\2855\1558563.pdf

Note the "BEGIN:;" which we will take care of now.


2. Elimination of the "BEGIN:" labels

This is rather simple pattern when looking at the result of the first regex substitution.

"(?m)^BEGIN:;"

You might think that you can do this through a string replacement - and so did i when writing the first version of my answer. However, a mere string replacement would become a problem when "BEGIN:;" could be part of the content of any other text field. Better to be correct and safe by specifying a regex which matches only at the beginning of a line.


3. Code example, including elimination of empty lines in the source text

If you have empty lines containing white-spaces in the source text, the regular expression displayed above might not work properly. The solution is to do another regex substitution beforehand, which reduces empty lines (including white-spaces) to a single line break (if you are certain that your source data does not contain empty lines, you can omit this step).

A complete code example, which would produce the result as mentioned at the beginning of my answer, could look like this:

string sourceData = ... your text with the source data ...

Regex reEmptyLines = new Regex(@"[\s\r\n]+[\r\n]", RegexOptions.Compiled);
Regex reSemicolons = new Regex(@"(([\r\n]+\s*Ad\sHoc:.*?[\r\n]+)|([\r\n]+(?!\s*BEGIN))).*?:\s*", RegexOptions.Compiled);
Regex reBegin = new Regex("(?m)^BEGIN:;", RegexOptions.Compiled);

string processed =
    reBegin.Replace(
        reSemicolons.Replace(
            reEmptyLines.Replace(sourceData, "\r\n"),
            ";"
        ),
        string.Empty
    );

其他推荐答案

You can use the regex, but I wouldn't say it is easier than doing it in cycle manually.

(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)

Regular expression visualization

Sample code:

foreach (Match m in Regex.Matches(text, @"(?<=BEGIN:\r\n)(?:.*:\s*(?:(?<value>(?<!Ad Hoc:\s*).*)|.*)(?:\r\n)?)*?(?=BEGIN:|$)"))
{
    Console.WriteLine(string.Join(",", m.Groups["value"].Captures.Cast<Capture>().Select(c => c.Value)));
}

Output:

Zoning Letter,4/16/2014,355211,712826,102,367,0,261711,0,1,0,0,16,0,2,0,0,\V367\2855\1558564.PDF
Zoning Letter,4/16/2014,355211,712825,102,367,0,19441,0,1,0,0,16,0,2,0,0,\V367\2855\1558563.pdf

其他推荐答案

How's this:

BEGIN:((?:(?!BEGIN:).)*)

This would match everything between the first BEGIN and the next.