lpmd.core.scrape module

lpmd.core.scrape.

class lpmd.core.scrape.BaseScraper(data_id)[source]

Bases: object

Base class for scraper.

Other scraper classes should inherit from this class. The class has base methods for scraping data, and in other scraper subclass get_scraped_data should be customized.

Parameters
data_idstr

String expressing which data should be scraped in data_catalogue.yml.

__init__(data_id)[source]
get_scraped_data(partition_id)[source]

Get scraped data corresponding to partition_id.

Parameters
partition_idstr

String expressing which partition data should be scraped in data_catalogue.yml.

Returns
df_scrapedpandas.core.frame.DataFrame

Scraped data that are not cleansed.

save_scraped_data(partition_id, path=None, **kwargs)[source]

Save scraped data.

Parameters
partition_idstr

String expressing which partition data should be scraped in data_catalogue.yml.

pathstr, default None

Sting expressing the path to save. If None, scraped data is stored at working current directory.

kwargs

Additional keyword arguments passed to pandas.DataFrame.to_csv.

Returns
has_savedbool

If scraped data is successfully saved, True. Otherwise, False.

save_batch(path=None, **kwargs)[source]

Save scraped data in batches that are defined in partition section of data_catalogue.yml.

Parameters
pathstr, default None

Sting expressing the path to save. If None, scraped data is stored at working current directory.

kwargs

Additional keyword arguments passed to pandas.DataFrame.to_csv.

Returns
dict_resultdict

Dict expressing whether partition_id in question is successfully saved.

aggregate()[source]

Aggregate scraped data that are defined in partition section of data_catalogue.yml into a single data frame.

Returns
dfpandas.core.frame.DataFrame

Aggregated data frame.

out_to_datasets()[source]

Write a DataFrame to the binary parquet format in lpmd.datasets.

Returns
file_pathstr

The file path to be saved.

class lpmd.core.scrape.ScraperShipment[source]

Bases: BaseScraper

Scraper class for data on livestock products shipment.

__init__()[source]
get_scraped_data(partition_id)[source]

Get scraped data corresponding to partition_id for livestock products shipment.

Parameters
partition_idstr

String expressing which partition data should be scraped in data_catalogue.yml.

Returns
df_scrapedpandas.core.frame.DataFrame

Scraped data that are not cleansed.

class lpmd.core.scrape.ScraperSlaughter[source]

Bases: BaseScraper

Scraper class for data on animals slaughtered and abattoirs.

__init__()[source]
get_scraped_data(partition_id)[source]

Get scraped data corresponding to partition_id for animals slaughtered and abattoirs.

Parameters
partition_idstr

String expressing which partition data should be scraped in data_catalogue.yml.

Returns
df_scrapedpandas.core.frame.DataFrame

Scraped data that are not cleansed.

class lpmd.core.scrape.ScraperCarcass[source]

Bases: BaseScraper

Scraper class for data on carcass.

__init__()[source]
get_scraped_data(partition_id)[source]

Get scraped data corresponding to partition_id for carcass.

Parameters
partition_idstr

String expressing which partition data should be scraped in data_catalogue.yml.

Returns
df_scrapedpandas.core.frame.DataFrame

Scraped data that are not cleansed.