c#itext7文本坐标提取问题[英] C# iText7 text coordinate extraction question

本文是小编为大家收集整理的关于c#itext7文本坐标提取问题的处理方法,想解了c#itext7文本坐标提取问题的问题怎么解决?c#itext7文本坐标提取问题问题的解决办法?c#itext7文本坐标提取问题问题的解决方案?那么可以参考本文帮助大家快速定位并解决问题,译文如有不准确的地方,大家可以切到English参考源文内容。

问题描述

我正在使用Itext7处理PDF文本提取器,并注意到某个PDF上的奇怪文本坐标.大多数文档似乎在页面的高度和宽度内产生X和Y坐标,但似乎会产生负面影响.我想知道这里是否有一种标准的方法来处理负面坐标.这种基本方法是使用PDF的正英寸测量值,并将其映射到ITEXT7提取的文本和坐标,每个点的比例值为1/72.

.

我来自locationTextExtrateTrategy,代码如下:

        private class LocationTextListStrategy : LocationTextExtractionStrategy
        {
            private readonly List<TextRect> _textRects = new List<TextRect>();

            public List<TextRect> TextRects() => _textRects;

            public override void EventOccurred(IEventData data, EventType type)
            {
                if (!type.Equals(EventType.RENDER_TEXT))
                    return;

                var renderInfo = (TextRenderInfo)data;
                var text = renderInfo.GetCharacterRenderInfos();

                foreach (var t in text)
                {
                    if (string.IsNullOrWhiteSpace(t.GetText()))
                        continue;

                    AddTextRect(t);
                }
            }

            private void AddTextRect(TextRenderInfo t)
            {
                var letterStart = t.GetBaseline().GetStartPoint();
                var letterEnd = t.GetAscentLine().GetEndPoint();

                var newTextRect = new TextRect(
                    text: t.GetText(),
                    l: letterStart.Get(0),
                    r: letterEnd.Get(0),
                    t: letterEnd.Get(1),
                    b: letterStart.Get(1));
                
                _textRects.Add(newTextRect);
            }
        }

推荐答案

每个PDF页面都可以具有自己的自定义坐标系.通常在页面的左下角具有起源,但不需要.

类型 value
Mediabox 矩形 (必需; sustralitable)矩形(请参见7.9.5,"矩形"),在默认用户空间单元中表示,该单元应定义页面应在其上的物理介质的边界显示或打印(请参阅14.11.2,"页面边界").
cropbox 矩形 (可选; sartherable)在默认用户空间单元中表达的矩形,该矩形将定义默认用户空间的可见区域.当页面显示或打印时,其内容应剪切(裁剪)到该矩形(请参阅14.11.2,"页面边界").默认值: Mediabox的值.

(ISO 32000-2:2017,表31 - 页面中的条目)

因此,始终解释有关他们所指的页面的裁剪框的坐标.

itext 7类PdfPage具有匹配的getters.

本文地址:https://www.itbaoku.cn/post/2352621.html

问题描述

I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF. Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives. I was wondering if there was a standard approach to dealing with negative coordinates here. This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.

I am deriving from the LocationTextExtractionStrategy and code is as follows:

        private class LocationTextListStrategy : LocationTextExtractionStrategy
        {
            private readonly List<TextRect> _textRects = new List<TextRect>();

            public List<TextRect> TextRects() => _textRects;

            public override void EventOccurred(IEventData data, EventType type)
            {
                if (!type.Equals(EventType.RENDER_TEXT))
                    return;

                var renderInfo = (TextRenderInfo)data;
                var text = renderInfo.GetCharacterRenderInfos();

                foreach (var t in text)
                {
                    if (string.IsNullOrWhiteSpace(t.GetText()))
                        continue;

                    AddTextRect(t);
                }
            }

            private void AddTextRect(TextRenderInfo t)
            {
                var letterStart = t.GetBaseline().GetStartPoint();
                var letterEnd = t.GetAscentLine().GetEndPoint();

                var newTextRect = new TextRect(
                    text: t.GetText(),
                    l: letterStart.Get(0),
                    r: letterEnd.Get(0),
                    t: letterEnd.Get(1),
                    b: letterStart.Get(1));
                
                _textRects.Add(newTextRect);
            }
        }

推荐答案

Each PDF page can have its own, custom coordinate system. It is common to have the origin in the lower left corner of the page but it is not required.

Key Type Value
MediaBox rectangle (Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries").
CropBox rectangle (Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries"). Default value: the value of MediaBox.

(ISO 32000-2:2017, Table 31 — Entries in a page)

Thus, always interpret coordinates in respect to the crop box of the page they refer to.

The iText 7 class PdfPage has matching getters.

查看更多