问题描述
我正在使用Itext7处理PDF文本提取器,并注意到某个PDF上的奇怪文本坐标.大多数文档似乎在页面的高度和宽度内产生X和Y坐标,但似乎会产生负面影响.我想知道这里是否有一种标准的方法来处理负面坐标.这种基本方法是使用PDF的正英寸测量值,并将其映射到ITEXT7提取的文本和坐标,每个点的比例值为1/72.
.我来自locationTextExtrateTrategy,代码如下:
private class LocationTextListStrategy : LocationTextExtractionStrategy { private readonly List<TextRect> _textRects = new List<TextRect>(); public List<TextRect> TextRects() => _textRects; public override void EventOccurred(IEventData data, EventType type) { if (!type.Equals(EventType.RENDER_TEXT)) return; var renderInfo = (TextRenderInfo)data; var text = renderInfo.GetCharacterRenderInfos(); foreach (var t in text) { if (string.IsNullOrWhiteSpace(t.GetText())) continue; AddTextRect(t); } } private void AddTextRect(TextRenderInfo t) { var letterStart = t.GetBaseline().GetStartPoint(); var letterEnd = t.GetAscentLine().GetEndPoint(); var newTextRect = new TextRect( text: t.GetText(), l: letterStart.Get(0), r: letterEnd.Get(0), t: letterEnd.Get(1), b: letterStart.Get(1)); _textRects.Add(newTextRect); } }
推荐答案
每个PDF页面都可以具有自己的自定义坐标系.通常在页面的左下角具有起源,但不需要.
键 | 类型 | value |
---|---|---|
Mediabox | 矩形 | (必需; sustralitable)矩形(请参见7.9.5,"矩形"),在默认用户空间单元中表示,该单元应定义页面应在其上的物理介质的边界显示或打印(请参阅14.11.2,"页面边界"). |
cropbox | 矩形 | (可选; sartherable)在默认用户空间单元中表达的矩形,该矩形将定义默认用户空间的可见区域.当页面显示或打印时,其内容应剪切(裁剪)到该矩形(请参阅14.11.2,"页面边界").默认值: Mediabox的值. |
(ISO 32000-2:2017,表31 - 页面中的条目)
因此,始终解释有关他们所指的页面的裁剪框的坐标.
itext 7类PdfPage具有匹配的getters.
问题描述
I am working on a PDF text extractor with iText7 and am noticing strange text coordinates on a certain PDF. Most documents appear to yield x and y coordinates within the height and width of the page, but one seems to yield negatives. I was wondering if there was a standard approach to dealing with negative coordinates here. This basic approach is to use positive inch measurements from a PDF and to map them to iText7 extracted text and coordinates with a 1/72 scale value for inches per dot.
I am deriving from the LocationTextExtractionStrategy and code is as follows:
private class LocationTextListStrategy : LocationTextExtractionStrategy { private readonly List<TextRect> _textRects = new List<TextRect>(); public List<TextRect> TextRects() => _textRects; public override void EventOccurred(IEventData data, EventType type) { if (!type.Equals(EventType.RENDER_TEXT)) return; var renderInfo = (TextRenderInfo)data; var text = renderInfo.GetCharacterRenderInfos(); foreach (var t in text) { if (string.IsNullOrWhiteSpace(t.GetText())) continue; AddTextRect(t); } } private void AddTextRect(TextRenderInfo t) { var letterStart = t.GetBaseline().GetStartPoint(); var letterEnd = t.GetAscentLine().GetEndPoint(); var newTextRect = new TextRect( text: t.GetText(), l: letterStart.Get(0), r: letterEnd.Get(0), t: letterEnd.Get(1), b: letterStart.Get(1)); _textRects.Add(newTextRect); } }
推荐答案
Each PDF page can have its own, custom coordinate system. It is common to have the origin in the lower left corner of the page but it is not required.
Key | Type | Value |
---|---|---|
MediaBox | rectangle | (Required; inheritable) A rectangle (see 7.9.5, "Rectangles"), expressed in default user space units, that shall define the boundaries of the physical medium on which the page shall be displayed or printed (see 14.11.2, "Page boundaries"). |
CropBox | rectangle | (Optional; Inheritable) A rectangle, expressed in default user space units, that shall define the visible region of default user space. When the page is displayed or printed, its contents shall be clipped (cropped) to this rectangle (see 14.11.2, "Page boundaries"). Default value: the value of MediaBox. |
(ISO 32000-2:2017, Table 31 — Entries in a page)
Thus, always interpret coordinates in respect to the crop box of the page they refer to.
The iText 7 class PdfPage has matching getters.