GDAL/OGR data processing workflows
I was recently tasked with analyzing metrics - namely feature counts - for many datasets from an Indian data provider. The dataset was provided in shapefile format and organized into folders for each state and territory, of which there are 29 states and 7 territories in India. The shapefiles had descriptive names which were identically named in each state’s/territory’s folder. See the following example:
Gujarat\ Highways.shp Rivers.shp Settlements.shp UttarPradesh\ Highways.shp Rivers.shp Settlements.shp
I wanted an efficient way to analyze the coverage (area for polygon features, distance for linear features, feature count for all geometries) and decided that importing this data into PostGIS would make this task simpler.
- write out the # of features per shapefile to Excel
- import data into a PostGIS database
- compare feature count metrics for original shapefile and imported features in PostGIS database.
Step 1: Generated initial metrics using ogrinfo, findstr, and echo:
Ogrinfo is a command-line tool from OSGeo GDAL libraries. Combined in use with echo and findstr - both built-in Windows command-line tools - I was able write the file path, the layer name, and feature count to a text file using the following command. This command recursively runs the ogrinfo tool - which provides basic metadata on OGR-compliant spatial data. Findstr allowed filtering of the output metadata to filter out the layer name and feature count:
FOR /R %f IN (*.shp) do ( echo %f >> shape_list.txt ogrinfo -al -so %f| findstr /I /c:"layer name" /c:"feature count" >> shape_list.txt )
The output was a very long text file, which was copied into Excel for further manipulation. For each shapefile, the following rows were written to the text file:
- File Path:
- Layer Name:
- Feature Count:
After copying this text file into Excel, I used the TRANSPOSE function to reorient this data from 1 row per statement (3 rows per shapefile) to one row with attributes written to three columns. After each row contained the initial variables only: path, file name, feature count, I used PivotTables to summarize the data on the Layer Name column.