0
我不确定我的代码有什么问题。它读取PDF文件,并抓取所有文本,但每个项目都合并为一个没有任何分隔符的字符串。逐行阅读PDF - iTextSharp
样品:
“房子:2
卧室:3
Bathsroom 4”
将获得读作 “房屋:两房一厅:3Bathsroom 4”
我已经搜遍了所有的例子都无济于事。我也试过LocationTextExtractionStrategy无济于事。我试过使用.split方法,没有任何帮助。
Public Shared Function ParseAllPdfText(ByVal filepath As String)
Dim sbtxt, currenttext As String
sbtxt = ""
Try
Using reader As New PdfReader(filepath)
For intPages As Integer = 1 To reader.NumberOfPages
currenttext = PdfTextExtractor.GetTextFromPage(reader, intPages, New LocationTextExtractionStrategy())
currenttext = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currenttext)))
sbtxt = sbtxt & currenttext & vbcrlf
Next
End Using
Catch ex As Exception
MsgBox(" There was an error extracting text from the file: " & ex.Message, vbInformation, "Error Extracting Text")
End Try
Return sbtxt
你可以分享有问题的pdf吗?此外,你打算通过'Encoding'杂耍线达到什么目的? – mkl