In this article, I will show you how easily one can extract data from PDF without any knowledge of VBA/macros or any sort of coding. Simple technique is required if you know M query or power query as we know it better. Some of you might already aware of this however this article would help folks who are new to BI world. Many a times, we come across this situation where we receive data in form of PDF, be it either an invoice or data related to an order. The customer asks us to create power BI or Excel dashboard which will be sourced based on these PDFs.
Now the question arises in front of us is how to achieve this
It is quite easy to source the data from PDF as many connecters are readily available in Excel/Power BI that helps you connect your application to PDF. The major work comes after data ingestion. So lets see how to extract data and perform data transformation steps to achieve desired output
Step 1 - Data Ingestion
Connect the PDF using Get data in a normal excel workbook. Remember to have O365 license else this option won't be there and you might have to use advanced editor to connect to a PDF using
let
Source = Pdf. Tables(//put your path and other variables)
in
Source
Step 2 - Data Transformation
Performing data transformation as per the PDF document. This step is bit customized for every document that you are working on. In this example I have a defined set of data for which I would like to extract the data. As soon as you go to advanced editor, you would see the data is categorized into many columns
Create a custom column that combines all columns to give a single column with all texts
The new column would appear as below and would skip all the null columns while doing concatenation
Adding a constant Index of 1 --> In the next step we would need a column on which we have to perform grouping transformation of all rows in a single line that can represent as a sentence. Hence we would need to have a column with a constant value. Here I have taken 1 where as we can write anything like, A or B or C etc.
Using grouping technique but with a slight change of implication - We know when do we use a grouping technique. Basically its an aggregator and usually sum or count based on same group. We shall leverage that idea but would not aggregate but rather combine the rows into a single row value
The code below is the step that we have to perform. Notice the Text. Combine function after each. This is the function that would combine or rather aggregate our multiple rows in to one single row as we have only 1 group i.e. Index 1
#"Grouped Rows" = Table. Group(#"Removed Other Columns",{"Index"},{{"All text", each Text. Combine([text]," "), type text}})
Later we can remove the index column as well and what we get is pure text extracted from PDF. This is a dynamic code and would always extract data in case there is any change in the text. Super helpful as mentioned earlier in extracting information from standard format of invoice, or order document, etc.
Thankyou for your patience. Let me know if this helps you anywhere. Happy learning
thanks for the wonderful tutorial. It just saved my day
nice article. very informative and helpful