Predicting Fertilizer Input for Rice Cultivation in India

AI for Agriculture

Geospatial Data Science

Traditional Machine Learning

Home to over 1.38 billion people, India is tackling a severe hunger crisis. Though the country has achieved self-sufficiency in grain production, nearly 14% of the population is still undernourished. India’s agricultural landscape is primarily rural, where widespread poverty, low literacy rates, and poor infrastructure lead to questions over its sustainability. Indiscriminate use of fertilizers has led to significant irregularity in crop production despite consistent agricultural subsidies. With the current global shortage of fertilizers, precision farming is vital to eliminate redundant costs and streamline resources to ensure equitable food access for all communities. Here, we assist policy-makers in their decisions through models predicting the fertilizer consumption (nitrogen, phosphorus, and potash) required to obtain specific rice yields in different cultivation environments.

Author

Akshay Suresh

Published

June 15, 2022

Note

Project team members: Akshay Suresh (lead), Arman Darbinyan, Dmitry Shcherbakov, Emilio Codecido & Leonardo Santana
Mentor: James Bramante
Github repo: may22-barrel
5-minutes video presentation: YouTube link
Presentation slides: PDF
Executive summary: PDF
Programming language: Python (numpy, pandas, scikit-learn, geopandas, matplotlib seaborn)
Supervised machine learning (ML) techniques: Linear regression, random forest, support vector machine
Unsupervised ML algorithms: \(k\)-means clustering, hierarchical clustering

Project Context

This project was completed as part of the Erdös Institute Data Science Bootcamp, Spring 2022. Within a hard 2-week deadline, team members were required to define their project goal, identify target stakeholders, gather data, and execute data analyses. The following deliverables were due at the end of the 2-week project window.

1-page executive summary
5-minutes video presentation
Presentation slides
Annotated Github repository

Links to our submitted deliverables are provided in a note near the top of this page.

Project Workflow

Project workflow

Project workflow diagram

Objective: Predict mean NPK (nitrogen, phosophorous and potash) fertilizer inputs to achieve specific rice yields in different cultivation environments across India.

Target stakeholders: Agriculture policy makers in India

Data sources:

District-level database (1990–2016) maintained by the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT):
- Rice cultivation data: cropped area, yield, irrigated area
- Seasonality and temperature
- Water cycle data: precipitation, surface runoff, and evapotranspiration
- Wind speed
- Historical NPK fertilizer usage
Shapefile of Indian districts from Kaggle

Modeling and Results

Rice is a hardy crop capable of thriving in a variety of soils, including loams, silts, and gravel. Collating up to 26 years of district-level rice cultivation (cropped area, yield, irrigated area) and environment data (temperature, precipitation, wind speed, evapotranspiration), our analysis involved three steps.

Firstly, we grouped districts with similar ecological parameters into clusters. To do so, we experimented with two unsupervised learning approaches, namely, \(k\)-means and hierarchical clustering. Both methods favored the grouping of Indian districts into 6 rice cultivation clusters.

Clustering results

Optimal clustering of Indian districts based on application of hierarchical clustering to our environmental data (temperature, precipitation, wind speed, evapotranspiration). Note that the above map bears some visual resemblance to the Koppen-Geiger climate classification map of India. However, we caution readers against performing meticulous comparisons between these maps as our algorithms additionally incorporate soil-dependent features such as surface runoff and evapotranspiration.

Over independent clusters, we regressed the historical NPK consumption data against rice yield. Here, we trialed simple linear regression, random forest regression, and support vector regression.

Results from application of support vector regression to cluster 1 data. Our fit residuals to the potash data (bottom row) look reasonably flat and structure-free. However, the same cannot be said for our nitrogen (top row) and phosphorous (middle row) fit residuals, possibly hinting at either some confounding variable or some parameter cross-correlations unaccounted for in our analyses.

The symmetric mean absolute percent error (SMAPE) offers a convenient difference-based relative measure for comparing model performances across clusters with unequal numbers of data points.

\[ {\rm SMAPE} = \left( \frac{100\%}{N_{\rm observations}} \right) \sum_{\rm observations} \left(\frac{|{\rm True \ value} - {\rm Predicted \ value}|}{(|{\rm True \ value}| + |{\rm Predicted \ value}|)/2} \right) \]

For a perfect model (true value = predicted value), \({\rm SMAPE} = 0\%\). Meanwhile, a null prediction (predicted value = 0) entails \({\rm SMAPE} = 200\%\).

SMAPE results

Studying the above SMAPE barplots, we note that support vector regression marginally outperforms other regression models. However, in general, our regression models are not too accurate, yielding median SMAPE values of about 50% for N and P, and nearly 75% for K.

Areas for Improvement

Consider incorporating data on soil nutrient content, soil type and texture, and solar irradiance to improve the accuracy of model fits.
Assess the impact of crop rotation and off-season farming practices on the sustainability of a desired crop yield.