I recently modified a script I wrote to extract data from a Word document to a csv file. The modified script had to iterate over multiple docs and extract data from certain tables based on certain keywords and fields.
I used the python-docx module to do this, but hit an obstacle when I realised that it could not (as yet) parse Word’s content controls. Since I only had 9 documents, I opened each, pasted some VBA code pilfered off StackOverflow to remove all content controls from the document.
While that worked temporarily, my next step is of course to schedule the script to automatically pull the data out once the folder is updated with the new batch of docs for the month. A solution suggested entails the code being saved inside the doc so it can be called via com.
I’m not happy with that solution because I would still need to open each document and insert the code. What I need to do now is fiddle around some more so that the code can be saved inside the script and then run on each document as needed.
I wrote this script for a former colleague last year. She had received a Word document containing about 600 tables which must have been dumped out of a database somewhere. The tables had the same header, and each table represented an “incident”, with dates, details etc.
She was requested to “put it into Excel”. After she had manually copied the first table into matching columns in a spreadsheet, she came to me. This type of thing is normally a task we would give to a student, as it has nothing to do with GIS. Nevertheless, when I saw the repeating structure I was sure I could come up with something to do this automatically.
The script finds all the tables in the document, and grabs the header from the first table to serve as the headings in the spreadsheet. It then iterates over all the tables, skips the header row and populates the spreadsheet with all the rows from the various tables.
It took about 15 minutes to write (had to play around with accessing the table elements correctly) and less than a minute to extract the data. That’s the amount of time it would have taken to copy 5 of the tables manually. At that pace it would have taken about 4 days to complete the process.
It’s been too long since I’ve posted some code. A few months ago I had a requirement to basically create a UML view of a file gdb and present each feature class as tables for inclusion in a functional specification Word document.
Sadly, since the demise of ArcGIS Diagrammer, there has never been something to take its place. The actual request I received was to take screenshots of the properties dialog box of each feature class in ArcCatalog, and paste those into the document with appropriate headings.
I recognised this request for the total waste of my time it would be, and promptly set about looking for an alternative. I realised I had python-docx already installed.
For each feature class in the geodatabase, a new paragraph is started with the name of the feature class in bold. A line break is inserted, followed by a table. The name and type of each field is added as a new row into the table, and a page break is inserted to start the properties of the next feature class on a new page.
It took me about an hour to look up alternative methods and to put this script together. Most of the time was spent on getting the cells in the table to insert properly, using the age-old method of trial and error. I did it this way to save myself the pain of manually inserting screenshots, knowing that if the format of the feature classes changed I would have to do new screenshots repeatedly.