top of page
neerajmukesh

How to extract data from PDF to Excel with Power Query (Low code - No Code technique/ No VBA)

In this article, I will show you how easily one can extract data from PDF without any knowledge of VBA/macros or any sort of coding. Simple technique is required if you know M query or power query as we know it better. Some of you might already aware of this however this article would help folks who are new to BI world. Many a times, we come across this situation where we receive data in form of PDF, be it either an invoice or data related to an order. The customer asks us to create power BI or Excel dashboard which will be sourced based on these PDFs.


Now the question arises in front of us is how to achieve this


It is quite easy to source the data from PDF as many connecters are readily available in Excel/Power BI that helps you connect your application to PDF. The major work comes after data ingestion. So lets see how to extract data and perform data transformation steps to achieve desired output



Step 1 - Data Ingestion


Connect the PDF using Get data in a normal excel workbook. Remember to have O365 license else this option won't be there and you might have to use advanced editor to connect to a PDF using


let 
    Source = Pdf. Tables(//put your path and other variables)
in 
    Source


Step 2 - Data Transformation


  • Performing data transformation as per the PDF document. This step is bit customized for every document that you are working on. In this example I have a defined set of data for which I would like to extract the data. As soon as you go to advanced editor, you would see the data is categorized into many columns

pdf to excel using power query
data transformation in power query

  • Create a custom column that combines all columns to give a single column with all texts


pdf to excel using power query
Creating a Custom Column By Concatenation

  • The new column would appear as below and would skip all the null columns while doing concatenation

pdf to excel using power query

  • Adding a constant Index of 1 --> In the next step we would need a column on which we have to perform grouping transformation of all rows in a single line that can represent as a sentence. Hence we would need to have a column with a constant value. Here I have taken 1 where as we can write anything like, A or B or C etc.


pdf to excel using power query
Adding an Index to classify all rows as one group

  • Using grouping technique but with a slight change of implication - We know when do we use a grouping technique. Basically its an aggregator and usually sum or count based on same group. We shall leverage that idea but would not aggregate but rather combine the rows into a single row value


pdf to excel using power query
Grouping multiple rows into one single row

  • The code below is the step that we have to perform. Notice the Text. Combine function after each. This is the function that would combine or rather aggregate our multiple rows in to one single row as we have only 1 group i.e. Index 1


#"Grouped Rows" = Table. Group(#"Removed Other Columns",{"Index"},{{"All text", each Text. Combine([text]," "), type text}})

  • Later we can remove the index column as well and what we get is pure text extracted from PDF. This is a dynamic code and would always extract data in case there is any change in the text. Super helpful as mentioned earlier in extracting information from standard format of invoice, or order document, etc.


PDF to Excel using power query
output with data in a single row as sentence

 

Thankyou for your patience. Let me know if this helps you anywhere. Happy learning


45 views2 comments

Recent Posts

See All

2 comentarios


nemukesh
11 oct 2022

thanks for the wonderful tutorial. It just saved my day

Me gusta

Invitado
05 jul 2022

nice article. very informative and helpful

Me gusta
bottom of page