Thursday, August 29, 2024

Lab 1: Fundamentals

The fundamentals lab covered accuracy, precision, and bias in spatial data. Accuracy per the lecture is the absence of error and is determined by comparing a coded value with an independent value. Precision is the variance or spread of the values in the dataset. The variance is often determined by standard deviation. The lecture continues with defining bias as the presence of systematic error. 

The lab enforced these concepts with two parts. Part A contained a set of points obtained by a GPS device and the know two point. First task was to obtain the mean point location from the set of way points. With the newly created mean point spatially join it to the way points and include the a distance attribute in meters.  Given this bit it was easy to find the 50%, 68%, and 95% percentile from the median value. The results is shown in the map below.

The remaining section of this lab leveraged the known good location (reference) which we derived metrics like root square mean error (RSME). This section I spent more time then needed because I authored a python script which perform the needed spatial join between the way points and reference point and imported the results, minus the geometry into a pandas DataFrame. There I used numpy to derive the RSME and the Series.describe method for the remaining statistics. 

That bit of prototype code is as follows:

import pandas as pd
import os
import numpy as np
import arcpy


def main() -> None:
    workspace = Path(os.getcwd()).parent
    data_dir = workspace / "Data" / "partb"
    gdb_name = "Lab1PartB.gdb"

    position_shp = data_dir / "positions.shp"
    benchmark_shp = data_dir / "benchmark.shp"

    # it will be a great day when arpy used pathlib
    if not arcpy.Exists(str(data_dir / gdb_name)):
        arcpy.CreateFileGDB_management(str(data_dir), gdb_name)

    with arcpy.EnvManager(workspace=str(workspace), overwriteOutput=True):
        output_feature = str(data_dir / gdb_name / "positions")
        target_feature = str(position_shp)
        join_feature = str(benchmark_shp)
        # Join and import the two shapefiles
        arcpy.analysis.SpatialJoin(
            target_feature, join_feature, output_feature, join_type="KEEP_ALL",
            match_option="CLOSEST", distance_field_name="BenchmarkDistance"
        )
        # create the dataframe (if the data set was extremely large this is
        # probably not the best way to accomplish this).
        fields = ['POINT_X', 'POINT_Y', 'POINT_X_1', 'POINT_Y_1', 'BenchmarkDistance']
        with arcpy.da.SearchCursor(output_feature, fields) as cursor:
            df = pd.DataFrame(data=[data for data in cursor], columns=fields)

        df.rename(columns={"POINT_X": "X", "POINT_Y": "Y", "POINT_X_1": "BenchmarkX", "POINT_Y_1": "BenchmarkY"},
                       inplace=True)

        # add the metrix columns
        df['ErrorX'] = df['X'] - df['BenchmarkX']
        df['ErrorY'] = df['Y'] - df['BenchmarkY']
        df['ErrorXYSqRd'] = df['ErrorX']**2 + df['ErrorY']**2
        df['ErrorXY'] = np.sqrt(df['ErrorXYSqRd'])

        ## describe for they summary stats on ErrorXY column
        print(df['ErrorXY'].describe([0.50, 0.68, 0.95]))
        # get teh RSME
        print('RSME', np.sqrt(df['ErrorXYSqRd'].mean()))
        # The data frame
        print(df)


if __name__ == "__main__":
    main()

And has the following output: 

One of the requirements was to create a Cumulative Distribution Function (CDF) scatter plot. 

 

No comments: