Tired of DOH case line data going down? Use this script to pull the data yourself

Hours after I published the new Florida COVID Community Dashboard last weekend, DOH again pulled its data offline. Two days later, it’s offline again. And it’ll probably go down tomorrow or the next day. It’s ridiculously easy to update this dataset, so the only possible reasons for its repeated crashing and restricted access is either gross incompetence or malicious malfeasance.

Never fear for R scripting is here!

Here’s a quick-and-easy R script that allows you to extract the case-line data from the daily PDF report published to the DOH website, so you no longer need to depend on the DOH API for that data.

While it is not as detailed as the case data through the DOH-AGO API (lacks hospital and deaths data), I’m working on putting a longer script together that adds those variables from the other PDF reports DOH publishes, as well. There are still some technical issues with columns lining up, but I’ll update the script here once I get the bugs fixed on other applications I’m developing.

Add your own folder location in “YOURFOLDER\\LOCATION\\HERE” part below, then you’re good to go!

FYI, if they try to “get sneaky” about hiding data, here’s the link to a list of all of DOH’s rest services.

library(XML);library(pdftools);library(stringr)

url <- ('http://ww11.doh.state.fl.us/comm/_partners/action/report_archive/state/state_reports_latest.pdf')
text1=pdf_text(url)


case_x=grepl("line list of cases",text1)
case_tables=which(case_x==TRUE)
mm=length(case_tables)

case_data=NULL
for(i in 1:mm){

	ttt <- t(str_split(text1[which(case_x==TRUE)[i]], "\r\n", simplify = TRUE))[-(1:6),]

	t2=strsplit(ttt[[1]],"")[[1]]
	#hh=str_replace(t2," ","|")
	t_a=c(regexpr("[[:alpha:]]", t2))

	county_start=which(t_a==1)[1]
	lx=length(t2)
	cwidths=c(county_start-1,13,5,8,9,15,13,8)
	sxx=c(1,cumsum(cwidths),lx)
	sxx[7]=lx-35;sxx[8]=lx-20;sxx[9]=lx-7

	nr=length(ttt)-1
	case_data0=NULL
		for(j in 1:nr){

			rx=t(substring(ttt[[j]],sxx[-(length(sxx))],sxx[-c(1)]-1))
			rx2=trimws(rx, which ="both")
			case_data0=rbind(case_data0,rx2)
		}	

case_data=rbind(case_data,case_data0)
}

case_data_df=data.frame(case_data[-which(case_data[,1]==""),])
names(case_data_df)=c("Case","County","Age","Gender","Travel","Origin","Contact","Jurisdiction","Case_Date")

save_path=paste0("C:\\YOURFOLDER\\LOCATION\\HERE",str_replace_all(Sys.Date(),"-","_"),".csv")
write.csv(case_data_df,save_path,row.names=F)




Website Powered by WordPress.com.