Currently, most AI for ecology efforts focus on processing or labeling raw data streams one modality at a time, ie detecting and categorizing species from bioacoustics, camera traps, aerial imagery, eDNA, etc. However, these data streams do not exist in a vacuum. Data collection efforts across modalities are often co-located and complementary. In this talk, I will discuss ongoing efforts to build systems that improve data categorization by making use of spatially co-located, but sometimes heterogeneously sampled, data modalities. I will go into detail with a case study on automating tree censuses in cities from aerial and street-level imagery, and also discuss the potential of "bycatch" information encoded in the background of images in large-scale data repositories like iNaturalist and Wildlife Insights.